<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Toru Maesaka &#187; storage</title>
	<atom:link href="http://torum.net/tag/storage/feed/" rel="self" type="application/rss+xml" />
	<link>http://torum.net</link>
	<description>Hackaholic and a Web Addict based in Tokyo</description>
	<lastBuildDate>Sat, 01 Oct 2011 18:46:45 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.2</generator>
		<item>
		<title>BlitzDB Crash Safety and Auto Recovery</title>
		<link>http://torum.net/2010/07/blitzdb-crash-safety-and-auto-recovery/</link>
		<comments>http://torum.net/2010/07/blitzdb-crash-safety-and-auto-recovery/#comments</comments>
		<pubDate>Thu, 22 Jul 2010 09:43:14 +0000</pubDate>
		<dc:creator>Toru Maesaka</dc:creator>
				<category><![CDATA[drizzle]]></category>
		<category><![CDATA[oss]]></category>
		<category><![CDATA[blitzdb]]></category>
		<category><![CDATA[hacking]]></category>
		<category><![CDATA[recovery]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://torum.net/?p=2369</guid>
		<description><![CDATA[Crash Safety is a big deal in the database league. Lack of durability can lead to all sorts of terrible things upon a catastrophic event. Many projects, especially in the so called NoSQL world compromises crash safety in return for higher QPS. The argument there is that the availability of the overall system should be [...]]]></description>
			<content:encoded><![CDATA[<p>Crash Safety is a big deal in the database league. Lack of durability can lead to all sorts of terrible things upon a catastrophic event. Many projects, especially in the so called NoSQL world compromises crash safety in return for higher QPS. The argument there is that the availability of the overall system should be accomplished by replication since a database server can&#8217;t be rescued if the physical disk breaks. I happen to agree with this philosophy but I am also aware that this isn&#8217;t a correct answer for everyone. So, what will I do with BlitzDB?</p>
<p>Several relational database hackers have pointed out that BlitzDB isn&#8217;t any safer than MyISAM since it doesn&#8217;t guarantee crash safety. This is currently true but I plan on making BlitzDB much safer than MyISAM by providing following features.</p>
<ol>
<li>Auto Recovery Routine (startup option)</li>
<li>Tokyo Cabinet&#8217;s Transaction API (table-specific option)</li>
</ol>
<p>The second feature above would actually guarantee BlitzDB to be crash safe (especially combined with auto recovery) but I won&#8217;t get into depth in this post since this topic deserves a blog post of it&#8217;s own. Let me just state that this feature will be provided in a form like this:</p>

<div class="wp_syntax"><div class="code"><pre class="sql" style="font-family:monospace;"><span style="color: #993333; font-weight: bold;">CREATE</span> <span style="color: #993333; font-weight: bold;">TABLE</span> t1 <span style="color: #66cc66;">&#40;</span>
  a int <span style="color: #993333; font-weight: bold;">PRIMARY</span> <span style="color: #993333; font-weight: bold;">KEY</span><span style="color: #66cc66;">,</span>
  b varchar<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">256</span><span style="color: #66cc66;">&#41;</span>
<span style="color: #66cc66;">&#41;</span> ENGINE <span style="color: #66cc66;">=</span> BLITZDB<span style="color: #66cc66;">,</span> CRASH_SAFE;</pre></div></div>

<p>From here on, I&#8217;ll cover how I plan on hacking auto recovery in BlitzDB.</p>
<h3>Auto Recovery Challenges</h3>
<p>As I blogged a while back, <a href="http://torum.net/2010/01/how-to-recover-a-tokyo-cabinet-database-file/">recovering Tokyo Cabinet</a> is relatively simple. However, this is not a sufficient solution in BlitzDB since the data file (hash database that actually holds the rows) and the index file(s) are independent from each other. That is, the likelihood of the data file and the index file(s) to be inconsistent is very high after a crash. So, how can we hack on this? Pretty simple.</p>
<h3>Indexes aren&#8217;t Important at Recovery Phase</h3>
<p>Because BlitzDB logically separates the data file and it&#8217;s indexes, index files aren&#8217;t that important. If a server crash had occurred, BlitzDB could delete the index file(s) and recompute them from the data file. Needless to say, this process would involve a lot of random access and computation but it would not dominate the time space of the system since it&#8217;s a one-time cost. This approach however has one flaw in it such that the index files can&#8217;t be recomputed if the data file is broken or is unrecoverable.</p>
<p>Therefore to guarantee crash safety, BlitzDB must ensure that the data file is unbreakable. This is precisely where Tokyo Cabinet&#8217;s Transaction API comes in. I&#8217;m planning on using it to protect the data file from breaking. If the data file is protected, the table can be rescued. Simple!</p>
<p>So, that&#8217;s what I have in mind for making BlitzDB a safer engine. Unfortunately I can&#8217;t start hacking on it immediately since I have several bugs to fix first. Nevertheless I&#8217;m looking forward to start hacking on it. This challenge should be quite fun to tackle.</p>
]]></content:encoded>
			<wfw:commentRss>http://torum.net/2010/07/blitzdb-crash-safety-and-auto-recovery/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Notes on HEAP/MyISAM Index Key Handling on WRITE</title>
		<link>http://torum.net/2010/01/notes-on-heap-myisam-key-generation/</link>
		<comments>http://torum.net/2010/01/notes-on-heap-myisam-key-generation/#comments</comments>
		<pubDate>Tue, 26 Jan 2010 08:57:05 +0000</pubDate>
		<dc:creator>Toru Maesaka</dc:creator>
				<category><![CDATA[drizzle]]></category>
		<category><![CDATA[knowledge]]></category>
		<category><![CDATA[oss]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://torum.net/?p=2331</guid>
		<description><![CDATA[Disclaimer: This post is based on HEAP/MyISAM&#8217;s sourcecode in Drizzle. Here are my brief notes on investigating how index keys are generated in HEAP and MyISAM. I lurked through these because I&#8217;ve started preparing for decent index support in BlitzDB. I also wrote this to assist my biological memory for later grepping (I have terrible [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Disclaimer: This post is based on HEAP/MyISAM&#8217;s sourcecode in Drizzle.</strong></p>
<p>Here are my brief notes on investigating how index keys are generated in HEAP and MyISAM. I lurked through these because I&#8217;ve started preparing for decent index support in BlitzDB. I also wrote this to assist my biological memory for later grepping (I have terrible memory for names). I&#8217;m only going to cover key generation on write in this post. Otherwise this post is going to be massive.</p>
<h3>HEAP Engine</h3>
<p>The index structure of HEAP can be either BTREE or HASH (in MySQL doc terms). Like other engines HEAP has a structure for keeping Key definition (parts, type, logic and etc). This structure is called HP_KEYDEF and it contains function pointers for write, delete, and getting the length of the key. These function pointers are assigned to at table creation or when the table is opened. The assigned function depends on the data structure of the index and it can be either of the following:</p>
<h4>BTREE</h4>
<ul>
<li>hp_rb_write_key()</li>
<li>hp_rb_delete_key()</li>
</ul>
<h4>HASH</h4>
<ul>
<li>hp_write_key()</li>
<li>hp_delete_key()</li>
</ul>
<p>As for get_key_length(), either of the following functions are used for both data structures.</p>
<ul>
<li>hp_rb_var_key_length()</li>
<li>hp_rb_null_key_length()</li>
<li>hp_rb_key_length()</li>
</ul>
<p>When writing a row to the tree, HEAP writes to the index using a key generated by hp_rb_make_key(). Note that it does not use this for the hash index. The generated key is populated inside &#8216;recbuffer&#8217; in HEAP&#8217;s handler object (HP_INFO structure).</p>
<p>From my understanding, it loops through the key segments (I suspect it is similar the internal  KEY_PART_INFO structure) and appropriately copies each key field value to the output buffer. By meaning &#8220;appropriately&#8221; it respects the characteristics of the data type when packing the buffer. For example, for a variable length field, it will only copy the actual data and not the max possible size of it. The final byte that is copied to the buffer is the address of the chunk where the record lives.</p>
<h3>MyISAM Engine</h3>
<p>The upper layer of key handling in MyISAM looks somewhat similar to HEAP so you can really tell that it was written by the same people. Things are nicely wrapped together by the MYISAM_SHARE structure so it&#8217;s relatively easy to follow. BlitzDB has a class called BlitzShare for the same purpose (This is based off Archive Engine&#8217;s ArchiveShare class).</p>
<p>Like HEAP, MyISAM has a structure for individual key definition called MI_KEYDEF (it&#8217;s defined in myisam.h). There are more function pointers in this structure than HEAP.</p>
<ul>
<li>bin_search()</li>
<li>get_key()</li>
<li>pack_key()</li>
<li>store_key()</li>
<li>ck_insert()</li>
<li>ck_delete()</li>
</ul>
<p>In Drizzle, _mi_ck_write() is assigned to ck_insert() which is the entry point to writing a MyISAM index. The key that MyISAM uses to write to the index is generated by _mi_make_key(). Like HEAP, it will loop through the key segments and pack the relevant fields accordingly to the characteristic of the data type. The output buffer belongs to MyISAM&#8217;s hander (lastkey2).</p>
<h3>From Here</h3>
<p>I&#8217;ve actually written a naive key generator for BlitzDB already based on Drizzle/MySQL&#8217;s internal KEY_PART_INFO array. It seems to be working on EXACT MATCH but I still need to implement an index scanner which looks much harder to pull off than a table scanner. What I&#8217;m really worried about is supporting composite indexes (namely reading/searching on it) but hopefully I&#8217;ll understand how this area of the storage system works soon.</p>
]]></content:encoded>
			<wfw:commentRss>http://torum.net/2010/01/notes-on-heap-myisam-key-generation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>End of Year Progress on BlitzDB</title>
		<link>http://torum.net/2009/12/end-of-year-progress-on-blitzdb/</link>
		<comments>http://torum.net/2009/12/end-of-year-progress-on-blitzdb/#comments</comments>
		<pubDate>Thu, 24 Dec 2009 09:47:06 +0000</pubDate>
		<dc:creator>Toru Maesaka</dc:creator>
				<category><![CDATA[drizzle]]></category>
		<category><![CDATA[oss]]></category>
		<category><![CDATA[blitzdb]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://torum.net/?p=2318</guid>
		<description><![CDATA[FURTHER UPDATE: Further thoughts on BlitzDB’s Index Handling My open source friends might have noticed that I&#8217;ve been working quite a bit on BlitzDB lately. To tell the truth, I had a hidden goal to get Version-1 done by Christmas. Unfortunately it doesn&#8217;t look like I can reach that goal. However, looking at the brightside [...]]]></description>
			<content:encoded><![CDATA[<p><strong>FURTHER UPDATE:</strong> <a href="http://torum.net/2010/01/further-thoughts-on-blitzdb-index/">Further thoughts on BlitzDB’s Index Handling</a></p>
<p>My open source friends might have noticed that I&#8217;ve been working quite a bit on BlitzDB lately. To tell the truth, I had a hidden goal to get Version-1 done by Christmas. Unfortunately it doesn&#8217;t look like I can reach that goal. However, looking at the brightside I got a lot done in the past few weeks so allow me to &#8220;journal&#8221; it in this blog post.</p>
<h3>Agony of Knowing</h3>
<p>The more I understood Drizzle&#8217;s storage mechanism and Tokyo Cabinet&#8217;s internals, the more I disliked what I previously had. This led me to spending quite a bit of time rewriting BlitzDB&#8217;s codebase. I was using pthread&#8217;s rwlock for concurrency control but I decided to design and write <a href="http://torum.net/2009/11/blitzdb-and-tc-concurrency-model/">BlitzDB&#8217;s own lock mechanism</a> to get the best out of TC (in terms of concurrency). I also rewrote the entire table scan code which is something you&#8217;d hope won&#8217;t be executed that often (people should use indexes!) but needless to say, it&#8217;s an important component of a relational storage engine so I&#8217;ve put in a lot of effort there.</p>
<h3>Rewriting the Table Scanner</h3>
<p>In the process of rewriting the table scanner, Jay Pipes&#8217; gave me a fantastic advise on using Drizzle&#8217;s internal atomic type (drizzled::atomics). He gave me this advise because he noticed that  my atomic ID generator was securing atomicity with pthread&#8217;s mutex. It is debatable that this mutex was only enabled for only few CPU instructions but the philosophy of using the most efficient method on the platform where BlitzDB is to be run was appealing enough for me to use drizzled::atomics. Mikio did some experiments on this and found that in a competitive/congested environment, using the compiler&#8217;s builtin function can gain you <a href="http://1978th.net/tech/promenade.cgi?id=68">3x throughput</a>.</p>
<h3>Hacking on Index Support</h3>
<p>I&#8217;ve finally started hacking on index support and I just finished supporting basic operations on a primary key. By design, BlitzDB&#8217;s index is a dense <a href="http://en.wikipedia.org/wiki/Index_(database)#Clustered">clustered</a> b+tree but in the first release I am going to limit PK to only be a HASH index. This is because I want BlitzDB to treat all PKs as direct keys inside the data dictionary (hash database where the actual rows are stored). So in other words, I want people to use PK for &#8220;needle in a haystack&#8221; like queries only. An example of a needle in a haystack like query is:</p>

<div class="wp_syntax"><div class="code"><pre class="sql" style="font-family:monospace;"><span style="color: #993333; font-weight: bold;">SELECT</span> <span style="color: #66cc66;">*</span> <span style="color: #993333; font-weight: bold;">FROM</span> <span style="color: #993333; font-weight: bold;">TABLE</span> <span style="color: #993333; font-weight: bold;">WHERE</span> primary_key_column <span style="color: #66cc66;">=</span> whatever;</pre></div></div>

<p>Saying that, I don&#8217;t like to force people to do things the way I like so I plan on providing best of both worlds by supporting both data structures for PKs in Version-2:</p>

<div class="wp_syntax"><div class="code"><pre class="sql" style="font-family:monospace;"><span style="color: #993333; font-weight: bold;">CREATE</span> <span style="color: #993333; font-weight: bold;">TABLE</span> t1 <span style="color: #66cc66;">&#40;</span>id int<span style="color: #66cc66;">,</span> <span style="color: #993333; font-weight: bold;">PRIMARY</span> <span style="color: #993333; font-weight: bold;">KEY</span><span style="color: #66cc66;">&#40;</span>id<span style="color: #66cc66;">&#41;</span> <span style="color: #993333; font-weight: bold;">USING</span> btree<span style="color: #66cc66;">&#41;</span> ENGINE<span style="color: #66cc66;">=</span>blitzdb;
<span style="color: #993333; font-weight: bold;">CREATE</span> <span style="color: #993333; font-weight: bold;">TABLE</span> t1 <span style="color: #66cc66;">&#40;</span>id int<span style="color: #66cc66;">,</span> <span style="color: #993333; font-weight: bold;">PRIMARY</span> <span style="color: #993333; font-weight: bold;">KEY</span><span style="color: #66cc66;">&#40;</span>id<span style="color: #66cc66;">&#41;</span> <span style="color: #993333; font-weight: bold;">USING</span> hash<span style="color: #66cc66;">&#41;</span> ENGINE<span style="color: #66cc66;">=</span>blitzdb;</pre></div></div>

<p>BlitzDB&#8217;s default configuration will use PK as a &#8220;direct&#8221; data dictionary index. If you wish to do range queries on PK, the solution is to create a index on the PK column.</p>
<h3>Primary Key lookup Performance</h3>
<p>So, how does my implementation perform? Here&#8217;s a quick benchmark with a test-run that randomly fetches 100 thousand rows from a BlitzDB table with 1 million rows. This is the table I used:</p>

<div class="wp_syntax"><div class="code"><pre class="sql" style="font-family:monospace;"><span style="color: #993333; font-weight: bold;">CREATE</span> <span style="color: #993333; font-weight: bold;">TABLE</span> t1 <span style="color: #66cc66;">&#40;</span>id int <span style="color: #993333; font-weight: bold;">PRIMARY</span> <span style="color: #993333; font-weight: bold;">KEY</span><span style="color: #66cc66;">,</span> a int<span style="color: #66cc66;">,</span> b int<span style="color: #66cc66;">&#41;</span> ENGINE<span style="color: #66cc66;">=</span>blitzdb;</pre></div></div>

<p>and the query looks like this:</p>

<div class="wp_syntax"><div class="code"><pre class="sql" style="font-family:monospace;"><span style="color: #993333; font-weight: bold;">SELECT</span> <span style="color: #66cc66;">*</span> <span style="color: #993333; font-weight: bold;">FROM</span> t1 <span style="color: #993333; font-weight: bold;">WHERE</span> id <span style="color: #66cc66;">=</span> random_number_under_one_million;</pre></div></div>

<p>The hardware I used is the following commodity server: Intel Quad Xeon E5345 (2x4MB L2 cache), 8GB Memory, 500GB SATA II. Unfortunately I could not prepare a standalone client server today so both the server and the test program were run on the same machine. Yeah&#8230; this sucks so I can&#8217;t claim that this benchmark is 100% creditable.</p>
<p>Here is the result I obtained from <a href="http://code.google.com/p/skyload">skyload</a>. Please only view it as a guideline to BlitzDB&#8217;s lookup performance. I&#8217;ll do a proper benchmark with the Drizzle Community and publish it after I get Version-1 released.</p>

<div class="wp_syntax"><div class="code"><pre class="null" style="font-family:monospace;">[ READ LOAD EMULATION RESULT ]
  SQL File               : 100k_select.sql
  Concurrent Connections : 1
  Task Completion Time   : 5.88856 secs
  Number of Queries:     : 100000
  Number of Test Runs:   : 1
&nbsp;
[ READ LOAD EMULATION RESULT ]
  SQL File               : 100k_select.sql
  Concurrent Connections : 2
  Task Completion Time   : 6.94474 secs
  Number of Queries:     : 100000
  Number of Test Runs:   : 1
&nbsp;
[ READ LOAD EMULATION RESULT ]
  SQL File               : 100k_select.sql
  Concurrent Connections : 4
  Task Completion Time   : 7.04455 secs
  Number of Queries:     : 100000
  Number of Test Runs:   : 1</pre></div></div>

<p>As you can see, &#8220;needle in a haystack&#8221; queries can be executed pretty efficiently in BlitzDB. Looking at the first result, we can observe that it took an average of 0.058 milliseconds to process a query.</p>
<h3>Future Plans</h3>
<p>Admittedly, primary key support isn&#8217;t completely done so I&#8217;ll continue working on it. After that, I will start hacking on b+tree indexes and write more tests as I go. Once I support at least two indexes, I&#8217;ll ask the Drizzle Community to consider merging BlitzDB into Drizzle&#8217;s trunk. This is my goal for BlitzDB at the moment.</p>
<p>I also happen to own blitzdb.com so I&#8217;m planning on putting user documentation (including tutorial) and architectural notes there. This is currently not so high on my TODO list so I suspect it won&#8217;t happen until I get Version-1 released. All I can say about the release schedule at the moment is, &#8220;before the MySQL conference in april&#8221;.</p>
<p>So, that&#8217;s all I have to summarize for now. Thanks for reading this far. Merry Christmas and have a Happy New Year. Don&#8217;t trip on ice :)</p>
]]></content:encoded>
			<wfw:commentRss>http://torum.net/2009/12/end-of-year-progress-on-blitzdb/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Drizzle Storage Engine Dev: Determining Query Type</title>
		<link>http://torum.net/2009/11/drizzle-engine-query-type/</link>
		<comments>http://torum.net/2009/11/drizzle-engine-query-type/#comments</comments>
		<pubDate>Wed, 25 Nov 2009 10:42:58 +0000</pubDate>
		<dc:creator>Toru Maesaka</dc:creator>
				<category><![CDATA[drizzle]]></category>
		<category><![CDATA[oss]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://torum.net/?p=2310</guid>
		<description><![CDATA[Determining what kind of SQL query is requested at the handler level is pretty important for BlitzDB since the strategy is to obtain the most suitable lock for a given request. Unfortunately there is no intuitive way to get this information. So, I took a peek into InnoDB&#8217;s sourcecode and found my solution (open source [...]]]></description>
			<content:encoded><![CDATA[<p>Determining what kind of SQL query is requested at the handler level is pretty important for BlitzDB since the strategy is to obtain the most suitable lock for a given request. Unfortunately there is no intuitive way to get this information. So, I took a peek into InnoDB&#8217;s sourcecode and found my solution (open source saves the day as usual).</p>
<h3>Solution</h3>
<p>In Drizzle, there is a function called <strong>session_sql_command(Session *session)</strong> which returns an integer that corresponds to one of the command type constants (which are accessible from the engine). Ideally I would like to call this function from anywhere in the engine but since it requires a session object as an argument, I could only call it from store_lock().</p>
<p>My solution was to add a variable in the handler class and assign the appropriate value to it from store_lock(). This turned out to be okay since store_lock() is called before any other API functions but the concern here is that store_lock() is planned for removal in the future.</p>
<p>Now I can do things like:</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;">ha_blitz<span style="color: #339933;">::</span><span style="color: #202020;">rnd_init</span><span style="color: #009900;">&#40;</span>bool drizzled_will_scan<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span>sql_command_type <span style="color: #339933;">==</span> SQLCOM_UPDATE<span style="color: #009900;">&#41;</span>
    <span style="color: #808080; font-style: italic;">/* get the most suitable lock type for this task */</span>
  <span style="color: #b1b100;">else</span> <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span>sql_command_type <span style="color: #339933;">==</span> SQLCOM_SELECT<span style="color: #009900;">&#41;</span>
    <span style="color: #808080; font-style: italic;">/* get the most suitable lock type for this task */</span>
  ...
<span style="color: #009900;">&#125;</span></pre></div></div>

<h3>Personal Request to Drizzle</h3>
<p>Although I would like to see store_lock() disappear from the storage engine API, I would like storage engines (technically worker threads) to have ability to gather meta information on the query before any real work is done.</p>
<p>My request is for store_lock() to become something along the line of <strong>gather_information()</strong> where it gives the handler (or worker threads) a chance to gather information about the query. Needless to say, drizzled must call this function before any other API calls are made.</p>
]]></content:encoded>
			<wfw:commentRss>http://torum.net/2009/11/drizzle-engine-query-type/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Drizzle Storage Engine Alias!</title>
		<link>http://torum.net/2009/11/drizzle-storage-engine-alias/</link>
		<comments>http://torum.net/2009/11/drizzle-storage-engine-alias/#comments</comments>
		<pubDate>Fri, 06 Nov 2009 03:34:20 +0000</pubDate>
		<dc:creator>Toru Maesaka</dc:creator>
				<category><![CDATA[drizzle]]></category>
		<category><![CDATA[api]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://torum.net/?p=2302</guid>
		<description><![CDATA[Admittedly I&#8217;m a lazy guy. Due to this nature I&#8217;m a little behind on the updates made to the Drizzle Storage Engine API. From lurking through the source code in the Drizzle trunk, I&#8217;ve noticed these changes. Engines can now have an alias handler class is replaced by the Cursor class Engine handler is now [...]]]></description>
			<content:encoded><![CDATA[<p>Admittedly I&#8217;m a lazy guy. Due to this nature I&#8217;m a little behind on the updates made to the Drizzle Storage Engine API. From lurking through the source code in the Drizzle trunk, I&#8217;ve noticed these changes.</p>
<ul>
<li>Engines can now have an alias</li>
<li>handler class is replaced by the Cursor class</li>
<li>Engine handler is now a subclass of Cursor</li>
<li>Table definitions are handled/stored by the storage engine</li>
<li>doCreateTable(), doDropTable(), doGetTableNames(), doGetTableDefinition()</li>
</ul>
<p>Currently I&#8217;m trying to catch up with the updated Drizzle Storage API and take this opportunity to rewrite most of BlitzDB. The reason is that the more I understand TC internal, the more mistakes I realized that I&#8217;ve made. I&#8217;ll blog more about this soon. Instead, I&#8217;m going to introduce something small but nice today.</p>
<p>A while back I poked folks like<a href="http://inaugust.com/"> Monty Taylor</a> and <a href="http://www.flamingspork.com/blog/">Stewart Smith</a> that it would be cool if engines could have an alias. I mentioned this because <a href="http://www.innodb.com">InnoDB</a> was allowed to use both &#8220;innodb&#8221; and &#8220;innobase&#8221; in the system whereas other engines could only have one name. Another reason I was interested in this issue was that I couldn&#8217;t understand how InnoDB could use two names since there was no way to do this in the old interface. Turns out InnoDB was treated specially in the core, which is obviously not desirable in a microkernel philosophy.</p>
<p>In the current Drizzle Storage Engine API, there is a function called addAlias(). By calling this function inside the storage engine constructor, you can allow your engine to have multiple aliases. For experiment purposes I wrote this to BlitzDB:</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;">BlitzEngine<span style="color: #009900;">&#40;</span><span style="color: #993333;">const</span> <span style="color: #993333;">string</span> <span style="color: #339933;">&amp;</span>name_arg<span style="color: #009900;">&#41;</span>
  <span style="color: #339933;">:</span> drizzled<span style="color: #339933;">::</span><span style="color: #202020;">plugin</span><span style="color: #339933;">::</span><span style="color: #202020;">StorageEngine</span><span style="color: #009900;">&#40;</span>name_arg<span style="color: #339933;">,</span> HTON_CAN_RECREATE<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  table_definition_ext <span style="color: #339933;">=</span> drizzled<span style="color: #339933;">::</span><span style="color: #202020;">plugin</span><span style="color: #339933;">::</span><span style="color: #202020;">DEFAULT_DEFINITION_FILE_EXT</span><span style="color: #339933;">;</span>
  addAlias<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;BLITZ&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
  addAlias<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;TCDB&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>Here, I added aliases BLITZ and TCDB. TCDB is my way of showing respect to <a href="http://1978th.net/">Mikio</a> and <a href="http://1978th.net/tokyocabinet/">Tokyo Cabinet</a>. So given the above, we should now be able to create tables with three names.</p>

<div class="wp_syntax"><div class="code"><pre class="sql" style="font-family:monospace;">drizzle<span style="color: #66cc66;">&gt;</span> <span style="color: #993333; font-weight: bold;">CREATE</span> <span style="color: #993333; font-weight: bold;">TABLE</span> t1 <span style="color: #66cc66;">&#40;</span>foo int<span style="color: #66cc66;">&#41;</span> engine<span style="color: #66cc66;">=</span>blitzdb;
Query OK<span style="color: #66cc66;">,</span> <span style="color: #cc66cc;">0</span> rows affected <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">0</span> sec<span style="color: #66cc66;">&#41;</span>
&nbsp;
drizzle<span style="color: #66cc66;">&gt;</span> <span style="color: #993333; font-weight: bold;">CREATE</span> <span style="color: #993333; font-weight: bold;">TABLE</span> t2 <span style="color: #66cc66;">&#40;</span>foo int<span style="color: #66cc66;">&#41;</span> engine<span style="color: #66cc66;">=</span>blitz;
Query OK<span style="color: #66cc66;">,</span> <span style="color: #cc66cc;">0</span> rows affected <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">0</span> sec<span style="color: #66cc66;">&#41;</span>
&nbsp;
drizzle<span style="color: #66cc66;">&gt;</span> <span style="color: #993333; font-weight: bold;">CREATE</span> <span style="color: #993333; font-weight: bold;">TABLE</span> t3 <span style="color: #66cc66;">&#40;</span>foo int<span style="color: #66cc66;">&#41;</span> engine<span style="color: #66cc66;">=</span>tcdb;
Query OK<span style="color: #66cc66;">,</span> <span style="color: #cc66cc;">0</span> rows affected <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">0.01</span> sec<span style="color: #66cc66;">&#41;</span></pre></div></div>

<p>Success! Yep, this is something so trivial that I think most people wouldn&#8217;t care about but I was happy to see this update in the trunk :)</p>
]]></content:encoded>
			<wfw:commentRss>http://torum.net/2009/11/drizzle-storage-engine-alias/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>BlitzDB Primary Key Based Insertion Performance</title>
		<link>http://torum.net/2009/07/blitzdb-primary-key-based-insertion-performance/</link>
		<comments>http://torum.net/2009/07/blitzdb-primary-key-based-insertion-performance/#comments</comments>
		<pubDate>Fri, 17 Jul 2009 02:16:52 +0000</pubDate>
		<dc:creator>Toru Maesaka</dc:creator>
				<category><![CDATA[drizzle]]></category>
		<category><![CDATA[oss]]></category>
		<category><![CDATA[blitzdb]]></category>
		<category><![CDATA[storage]]></category>
		<category><![CDATA[tc]]></category>

		<guid isPermaLink="false">http://torum.net/?p=2276</guid>
		<description><![CDATA[Like most things, I think storage engine development is about divide and conquer. The first sub-problem that I&#8217;m tackling with BlitzDB is squeezing as much juice out as possible from Tokyo Cabinet to achieve fast write performance. This by the way happens to be the primary reason that I wrote skyload. Writing skyload turned out [...]]]></description>
			<content:encoded><![CDATA[<p>Like most things, I think storage engine development is about divide and conquer. The first sub-problem that I&#8217;m tackling with <a href="https://launchpad.net/blitzdb">BlitzDB</a> is squeezing as much juice out as possible from <a href="http://tokyocabinet.sourceforge.net">Tokyo Cabinet</a> to achieve fast write performance. This by the way happens to be the primary reason that I wrote <a href="http://code.google.com/p/skyload">skyload</a>.</p>
<p>Writing skyload turned out to be worthwhile since it helped me find several critical bugs in the engine that only occurred under concurrent insertion load. Thanks to <a href="http://developer.cybozu.co.jp/kazuho/in_english/">Kazuho Oku</a> for helping me through the issues that I was facing.</p>
<p>I think I&#8217;ve now reached a stage where I can share how well BlitzDB can perform insertion from concurrent connections. But before moving ahead, I&#8217;d like to emphasize that for a real guideline, I believe that performance comparison should be done by an unbiased third party. So please don&#8217;t take the results in this post as the &#8220;truth&#8221;. Heh, I did write both the storage engine and the load emulator after all :)</p>
<p>So, with the above in mind, here&#8217;s a skyload result on inserting one-hundred-thousand rows under different concurrency levels with BlitzDB and <a href="http://dev.mysql.com/doc/refman/5.4/en/myisam-storage-engine.html">MyISAM</a> (both engines under default configuration). </p>
<p style="text-align: center;"><a href="http://www.flickr.com/photos/tmaesaka/3727653385/" title="Skyload Result - MyISAM and BlitzDB by tmaesaka, on Flickr"><img src="http://farm3.static.flickr.com/2556/3727653385_790b0a8b81_o.png" width="500" height="305" alt="Skyload Result - MyISAM and BlitzDB" /></a></p>
<p>Figures presented above are calculated from an average of 5 runs per each concurrency level. Admittedly, an average from 5 runs is not sufficient to claim credibility of my result since the figures can easily be affected by the dirty buffer flush between the kernel and the filesystem (<a href="http://en.wikipedia.org/wiki/Ext3">ext3</a> in this particular benchmark). For this, I plan on extending skyload to run multiple runs of an identical test and compute the median and average.</p>
<p>For those that are interested, this is what the test table looks like:</p>

<div class="wp_syntax"><div class="code"><pre class="sql" style="font-family:monospace;"><span style="color: #993333; font-weight: bold;">CREATE</span> <span style="color: #993333; font-weight: bold;">TABLE</span> t1 <span style="color: #66cc66;">&#40;</span>
    id int <span style="color: #993333; font-weight: bold;">PRIMARY</span> <span style="color: #993333; font-weight: bold;">KEY</span><span style="color: #66cc66;">,</span>
    col1 int<span style="color: #66cc66;">,</span>
    col2 double<span style="color: #66cc66;">,</span>
    col3 varchar<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">255</span><span style="color: #66cc66;">&#41;</span>
<span style="color: #66cc66;">&#41;</span> ENGINE<span style="color: #66cc66;">=</span>blitz;</pre></div></div>

<p>I didn&#8217;t bother benchmarking anything beyond 32 connections since I ran both the client and the server on the same quad core machine (there&#8217;s no point). This is probably why you can see a nice curve up to four concurrent connections with BlitzDB in the graph. Yet another reason why you should not believe everything I&#8217;ve provided in this entry.</p>
<h3>BlitzDB needs you</h3>
<p>BlitzDB is still very early in it&#8217;s making and I still have insane amount of work to do. For example, BlitzDB currently requires you to supply a primary key on your table. I plan on removing this limitation by generating a &#8220;fake&#8221; primary key internally but I still haven&#8217;t got around to it at this point.</p>
<p>Support for multiple indexes is not done yet despite having all the necessary components to achieve it. I could do all this on my own but I prefer not to. I’m totally open for ideas and contributors. If you&#8217;re interested in this storage engine project, please don&#8217;t hesitate to ping me (dev @ this domain) or the <a href="https://launchpad.net/~drizzle-discuss">Drizzle community</a>.  More eyeballs the merrier :)</p>
]]></content:encoded>
			<wfw:commentRss>http://torum.net/2009/07/blitzdb-primary-key-based-insertion-performance/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Notes on changes made to the Drizzle Storage Subsystem</title>
		<link>http://torum.net/2009/07/changes-to-the-drizzle-storage-subsystem/</link>
		<comments>http://torum.net/2009/07/changes-to-the-drizzle-storage-subsystem/#comments</comments>
		<pubDate>Thu, 09 Jul 2009 09:26:43 +0000</pubDate>
		<dc:creator>Toru Maesaka</dc:creator>
				<category><![CDATA[drizzle]]></category>
		<category><![CDATA[oss]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://torum.net/?p=2274</guid>
		<description><![CDATA[Yesterday I merged the BlitzDB tree with Drizzle&#8216;s trunk for the first time in a long time (yeah&#8230;) and discovered some interesting changes made to the storage subsystem while I was away. Previously all functions that caused an action to the storage engine was a member of the handler class but various things like table [...]]]></description>
			<content:encoded><![CDATA[<p>Yesterday I merged the<a href="https://code.launchpad.net/~tmaesaka/blitzdb/trunk"> BlitzDB tree</a> with <a href="https://launchpad.net/drizzle">Drizzle</a>&#8216;s trunk for the first time in a long time (yeah&#8230;) and discovered some interesting changes made to the storage subsystem while I was away.</p>
<p>Previously all functions that caused an action to the storage engine was a member of the handler class but various things like table creation and transaction related functions have now moved to the StorageEngine class. These changes are somewhat drastic but makes good sense for Drizzle to grow further since it makes the subsystem easier to understand and frees Drizzle from the interface design that was strongly affected by MyISAM. For those that are interested, the StorageEngine class is located in &#8220;drizzled/plugin/storage_engine.h&#8221;. </p>
<p>For me it was pretty easy to update BlitzDB to work with the new subsystem since I don&#8217;t have anything special in the engine that required me to use my brain. I only had to move <a href="http://forge.mysql.com/wiki/MySQL_Internals_Custom_Engine#bas_ext">bas_ext()</a>, table creation and rename functions over to the StorageEngine class and adjust it to the new interface:</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;"><span style="color: #993333;">int</span> createTableImpl<span style="color: #009900;">&#40;</span>Session <span style="color: #339933;">*</span>session<span style="color: #339933;">,</span> <span style="color: #993333;">const</span> <span style="color: #993333;">char</span> <span style="color: #339933;">*</span>table_name<span style="color: #339933;">,</span> 
                    Table <span style="color: #339933;">*</span>table_arg<span style="color: #339933;">,</span> HA_CREATE_INFO <span style="color: #339933;">*</span>ha_create_info<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> 
&nbsp;
<span style="color: #993333;">int</span> renameTableImpl<span style="color: #009900;">&#40;</span>Session <span style="color: #339933;">*</span>session<span style="color: #339933;">,</span> <span style="color: #993333;">const</span> <span style="color: #993333;">char</span> <span style="color: #339933;">*</span>from<span style="color: #339933;">,</span> <span style="color: #993333;">const</span> <span style="color: #993333;">char</span> <span style="color: #339933;">*</span>to<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p>For a real example, I recommend comparing the old InnobaseEngine class declaration with the updated one. As for where this redesign is going, this is the answer I got on the Drizzle channel from <a href="http://www.flamingspork.com/blog/">Stewart</a> who did the actual work for all this.</p>
<blockquote><p>stewart: tmaesaka: the basic idea is that handler becomes a cursor. the StorageEngine is for actions on the engine.<br />
stewart: tmaesaka: and handler is a cursor on a table.</p></blockquote>
<p>Something to keep in mind if you&#8217;re thinking about creating or porting a storage engine to Drizzle :)</p>
]]></content:encoded>
			<wfw:commentRss>http://torum.net/2009/07/changes-to-the-drizzle-storage-subsystem/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Storage Engine Dev Journal #3 : Supporting variable width tables</title>
		<link>http://torum.net/2009/06/supporting-variable-width-tables/</link>
		<comments>http://torum.net/2009/06/supporting-variable-width-tables/#comments</comments>
		<pubDate>Tue, 16 Jun 2009 12:14:52 +0000</pubDate>
		<dc:creator>Toru Maesaka</dc:creator>
				<category><![CDATA[drizzle]]></category>
		<category><![CDATA[knowledge]]></category>
		<category><![CDATA[oss]]></category>
		<category><![CDATA[engine]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://torum.net/?p=2165</guid>
		<description><![CDATA[Something I&#8217;ve added to BlitzDB recently that was pretty high on my todo list is support for variable width tables. So what is a variable width table? it is a table that contains columns that can vary in size, namely BLOB and TEXT types. Going back to the basics, when a new row is to [...]]]></description>
			<content:encoded><![CDATA[<p>Something I&#8217;ve added to BlitzDB recently that was pretty high on my todo list is support for variable width tables. So what is a variable width table? it is a table that contains columns that can vary in size, namely <a href="http://dev.mysql.com/doc/refman/5.4/en/blob.html">BLOB and TEXT</a> types.</p>
<p>Going back to the basics, when a new row is to be written, a storage engine is given a pointer to the row data in MySQL format that it must somehow store for later lookup/retrieval. By meaning &#8220;somehow&#8221;, the storage engine is given the freedom to do whatever it likes with the row.</p>
<p>Writing a row for a fixed length table (a table with columns that are always the same size) is deadly easy. A storage engine can choose to not temper with the row and simply write or copy the data to it&#8217;s storage mechanism. This is because the storage engine is given a row that contains all the data. Rows for variable width tables however, are treated differently since things aren&#8217;t as simple (it&#8217;s variable!).</p>
<p>The difference is that columns for BLOB and TEXT types are represented by two parts inside a MySQL/Drizzle row:</p>
<ul>
<li>length of the data</li>
<li>pointer to the actual data</li>
</ul>
<p>This is simple to understand since we need to know the size of the data to copy it.</p>
<h4>Minor Complication</h4>
<p>The minor complication as you would expect here is that you can&#8217;t directly write the provided row to your engine like you can with fixed length tables. The data that you want to copy/write exists elsewhere (hence the pointer) so directly writing the row has no meaning (the data would have disappeared by your next access to that row). You need to make sure that the actual data for BLOB/TEXT column(s) are arranged appropriately on your engine&#8217;s row buffer and written out to it&#8217;s storage mechanism.</p>
<p>This process is commonly referred to as row packing (converting to your engine format) and unpacking (convert back to MySQL format). So how is this done? it&#8217;s actually pretty simple!</p>
<h4>The solution is actually simple</h4>
<p>As much as it sounds like a bother to support variable length rows, it&#8217;s actually not that bad. First you need to understand what a MySQL row looks like internally.</p>
<p>A MySQL row begins with a bitset that represents which fields are NULL. The length of this data obviously depends on the number of NULLable columns you have but this is easy to handle with Drizzle since we&#8217;re given all the relevant information by the TableShare object (same goes for MySQL from a different object).</p>
<p>After this data comes the actual column data in the order that appears in your CREATE TABLE statement. What you need to do to get packing working with this row is the not-so-obvious part that you really need an example to look at. Fortunately Tweeting about this attracted <a href="http://twitter.com/brianaker/status/2026228307">Brian&#8217;s attention</a> which helped me move forward.</p>
<h4>Loop the fields!</h4>
<p>So, let&#8217;s take row insertion to a variable width table as an example. Imagine this table:</p>

<div class="wp_syntax"><div class="code"><pre class="sql" style="font-family:monospace;"><span style="color: #993333; font-weight: bold;">CREATE</span> <span style="color: #993333; font-weight: bold;">TABLE</span> t1 <span style="color: #66cc66;">&#40;</span>
  id int <span style="color: #993333; font-weight: bold;">PRIMARY</span> <span style="color: #993333; font-weight: bold;">KEY</span> <span style="color: #993333; font-weight: bold;">NOT</span> <span style="color: #993333; font-weight: bold;">NULL</span><span style="color: #66cc66;">,</span>
  description text<span style="color: #66cc66;">,</span>
  arbitrary_data blob
<span style="color: #66cc66;">&#41;</span> engine<span style="color: #66cc66;">=</span>your_engine;</pre></div></div>

<p>and let&#8217;s imagine that we need to process this query:</p>

<div class="wp_syntax"><div class="code"><pre class="sql" style="font-family:monospace;"><span style="color: #993333; font-weight: bold;">INSERT</span> <span style="color: #993333; font-weight: bold;">INTO</span> t1 <span style="color: #993333; font-weight: bold;">VALUES</span> <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">1</span><span style="color: #66cc66;">,</span> <span style="color: #ff0000;">&quot;hello world&quot;</span><span style="color: #66cc66;">,</span> <span style="color: #ff0000;">&quot;blobbbbb&quot;</span><span style="color: #66cc66;">&#41;</span>;</pre></div></div>

<p>Now, the storage engine needs to &#8220;pack&#8221; the data for each column into it&#8217;s buffer in the <a href="http://forge.mysql.com/wiki/MySQL_Internals_Custom_Engine#Adding_Support_for_INSERT_to_a_Storage_Engine">write_row()</a> function. Conveniently, Drizzle/MySQL provides a pack() function for it&#8217;s column types (fields) that will do the data packing for you. That is, you do not have to inspect the provided row for pointers to the actual data and do the packing/copying yourself. </p>
<p>How? well, the table object (which is visible from your engine) conveniently holds a list of fields in the appropriate order. The actual pack() function is a member of these fields so you just need to call it as you loop over the list:</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;"><span style="color: #808080; font-style: italic;">/* make sure row_buffer has enough memory */</span>
<span style="color: #993333;">unsigned</span> <span style="color: #993333;">char</span> <span style="color: #339933;">*</span>pos <span style="color: #339933;">=</span> row_buffer<span style="color: #339933;">;</span>
&nbsp;
<span style="color: #808080; font-style: italic;">/* copy NULL bits, &quot;table-&gt;s&quot; is the TableShare object */</span>
memcpy<span style="color: #009900;">&#40;</span>pos<span style="color: #339933;">,</span> row<span style="color: #339933;">,</span> table<span style="color: #339933;">-&gt;</span>s<span style="color: #339933;">-&gt;</span>null_bytes<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
pos <span style="color: #339933;">+=</span> table<span style="color: #339933;">-&gt;</span>s<span style="color: #339933;">-&gt;</span>null_bytes<span style="color: #339933;">;</span>
&nbsp;
<span style="color: #808080; font-style: italic;">/* &quot;row&quot; is the MySQL formatted row given by the core */</span>
<span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span>Field <span style="color: #339933;">**</span>field <span style="color: #339933;">=</span> table<span style="color: #339933;">-&gt;</span>field<span style="color: #339933;">;</span> <span style="color: #339933;">*</span>field<span style="color: #339933;">;</span> field<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #339933;">!</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">*</span>field<span style="color: #009900;">&#41;</span><span style="color: #339933;">-&gt;</span>is_null<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
    pos <span style="color: #339933;">=</span> <span style="color: #009900;">&#40;</span><span style="color: #339933;">*</span>field<span style="color: #009900;">&#41;</span><span style="color: #339933;">-&gt;</span>pack<span style="color: #009900;">&#40;</span>pos<span style="color: #339933;">,</span> row <span style="color: #339933;">+</span> <span style="color: #009900;">&#40;</span><span style="color: #339933;">*</span>field<span style="color: #009900;">&#41;</span><span style="color: #339933;">-&gt;</span>offset<span style="color: #009900;">&#40;</span>row<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>The above code snippet will populate &#8220;row_buffer&#8221; with the actual data that you want to write to your storage mechanism. You do not have to forward the &#8220;pos&#8221; pointer because pack() returns a pointer at the end of where it had worked in the buffer (think Pascal Strings). This is precisely why we created the pos pointer, to avoid row_buffer from being forwarded.</p>
<p>For the opposite situation (when retrieving a row), an unpack() function is provided for each field so you just need to take advantage of it like we did with the pack() snippet above.</p>
<h4>Little bit more on fields</h4>
<p>The actual pack() function that gets called depends on the type of column since the Field class is an abstract base class for the sub classes that actually represents column types inside Drizzle/MySQL. If you want to know what a pack() function looks like for a BLOB type, grep for &#8220;Field_blob&#8221; in the source tree and there will be a pack() member function for it.</p>
<p>The code layout for field subsystem in MySQL is rather difficult to comprehend since everything is crammed in &#8220;sql/field.c&#8221; and &#8220;sql/field.h&#8221; files (at least as of 5.4). So, if you want to get a good grasp of how things are architectured, you should take a look at Drizzle. Field subclasses are located individually in the &#8220;drizzled/field/&#8221; directory and the base class is located in &#8220;drizzled/field.h&#8221;.</p>
<p>So, that&#8217;s about it! Hopefully this information will help other engine developers when they come across a need to support variable width tables :)</p>
]]></content:encoded>
			<wfw:commentRss>http://torum.net/2009/06/supporting-variable-width-tables/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Tokyo Cabinet Tip: Protected Database Iteration</title>
		<link>http://torum.net/2009/05/tokyo-cabinet-protected-database-iteration/</link>
		<comments>http://torum.net/2009/05/tokyo-cabinet-protected-database-iteration/#comments</comments>
		<pubDate>Wed, 13 May 2009 06:29:17 +0000</pubDate>
		<dc:creator>Toru Maesaka</dc:creator>
				<category><![CDATA[knowledge]]></category>
		<category><![CDATA[oss]]></category>
		<category><![CDATA[storage]]></category>
		<category><![CDATA[tip]]></category>
		<category><![CDATA[tokyocabinet]]></category>

		<guid isPermaLink="false">http://torum.net/?p=1688</guid>
		<description><![CDATA[Tokyo Cabinet (TC) provides iteration functionality for both it&#8217;s persistent and non-persistent data structures. For example, if you wanted to iterate through TC&#8217;s hash database, you can use the tchdbiternext() function. This is really straight forward to use such that: void *key; int key_len; &#160; if &#40;tchdbiterinit&#40;tc_database_handle&#41; != true&#41; &#123; /* failed to initialize iterator [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://tokyocabinet.sourceforge.net">Tokyo Cabinet</a> (TC) provides iteration functionality for both it&#8217;s persistent and non-persistent data structures. For example, if you wanted to iterate through TC&#8217;s hash database, you can use the  tchdbiternext() function. This is really straight forward to use such that:</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;"><span style="color: #993333;">void</span> <span style="color: #339933;">*</span>key<span style="color: #339933;">;</span>
<span style="color: #993333;">int</span> key_len<span style="color: #339933;">;</span>
&nbsp;
<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span>tchdbiterinit<span style="color: #009900;">&#40;</span>tc_database_handle<span style="color: #009900;">&#41;</span> <span style="color: #339933;">!=</span> <span style="color: #000000; font-weight: bold;">true</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #808080; font-style: italic;">/* failed to initialize iterator */</span>
<span style="color: #009900;">&#125;</span>
&nbsp;
<span style="color: #b1b100;">while</span> <span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span>key <span style="color: #339933;">=</span> tchdbiternext<span style="color: #009900;">&#40;</span>tc_database_handle<span style="color: #339933;">,</span> <span style="color: #339933;">&amp;</span>key_len<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">!=</span> NULL<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #808080; font-style: italic;">/* work with the fetched key and key_len */</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>will iterate through the entire hash database that &#8220;tc_database_handle&#8221; object is responsible for. This can be handy if you need to loop through your database for some arbitrary reason.</p>
<p>However, there is a consequence in using this function in a concurrent environment with a use-case where the order of records _really_ matter. This is because even though TC is a thread-safe library, the iteration functions aren&#8217;t thread-safe in a way that we expect.</p>
<p>For example, if a write operation occurs while the application iterates over the database, you will end up iterating over a database that is in a changed state. This will not make the cursor go crazy and crash your application since TC handles this internally but you still end up iterating over a database that is in a state that you did not initially intend on looping through.</p>
<p>Solution to this is to simply block write operations to the database while your application iterates through. For example, you could use pthread&#8217;s <a href="http://en.wikipedia.org/wiki/Readers-writer_lock">rw_lock</a> to allow other threads to read while you iterate but block writes until you finish iterating.</p>
<p>I was planning on doing this for a table scanner in the <a href="https://launchpad.net/blitzdb">storage engine</a> that I&#8217;m currently working on but turns out TC has an undocumented function that will take care of this internally. I&#8217;ve talked to Mikio about this function and apparently it is intentional that he hasn&#8217;t documented it on his <a href="http://tokyocabinet.sourceforge.net/spex-en.html">specification page</a>. He has no plans on throwing it out so you do not have to worry about it to magically disappear one day. For more information, you can take a look at his header file (tchdb.h for hash database).</p>
<h4>Explanation and Simple Example</h4>
<p>The function is called <strong>tchdbforeach()</strong> which will atomically iterate through your database from beginning to the end by supplying each key/value pair to the callback function that you provide. The signature of the callback is the following:</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;">bool callback<span style="color: #009900;">&#40;</span><span style="color: #993333;">const</span> <span style="color: #993333;">void</span> <span style="color: #339933;">*</span>kbuf<span style="color: #339933;">,</span> <span style="color: #993333;">int</span> ksiz<span style="color: #339933;">,</span> <span style="color: #993333;">const</span> <span style="color: #993333;">void</span> <span style="color: #339933;">*</span>vbuf<span style="color: #339933;">,</span>
              <span style="color: #993333;">int</span> vsiz<span style="color: #339933;">,</span> <span style="color: #993333;">void</span> <span style="color: #339933;">*</span>op<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p>where the fifth argument, &#8220;void *op&#8221; is an opaque pointer to the data that you can pass to the callback. Here is a simple example that will increment a counter integer on each iteration using this function:</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;"><span style="color: #808080; font-style: italic;">/* Do whatever you like with the provided key/value pair in here */</span>
bool callback<span style="color: #009900;">&#40;</span><span style="color: #993333;">const</span> <span style="color: #993333;">void</span> <span style="color: #339933;">*</span>kbuf<span style="color: #339933;">,</span> <span style="color: #993333;">int</span> ksiz<span style="color: #339933;">,</span> <span style="color: #993333;">const</span> <span style="color: #993333;">void</span> <span style="color: #339933;">*</span>vbuf<span style="color: #339933;">,</span>
              <span style="color: #993333;">int</span> vsiz<span style="color: #339933;">,</span> <span style="color: #993333;">void</span> <span style="color: #339933;">*</span>op<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span>op <span style="color: #339933;">==</span> NULL<span style="color: #009900;">&#41;</span>
    <span style="color: #b1b100;">return</span> <span style="color: #000000; font-weight: bold;">false</span><span style="color: #339933;">;</span>
&nbsp;
  <span style="color: #339933;">*</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span><span style="color: #993333;">int</span> <span style="color: #339933;">*</span><span style="color: #009900;">&#41;</span>op<span style="color: #009900;">&#41;</span> <span style="color: #339933;">+=</span> <span style="color: #0000dd;">1</span><span style="color: #339933;">;</span>
&nbsp;
  <span style="color: #b1b100;">return</span> <span style="color: #000000; font-weight: bold;">true</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span>
&nbsp;
<span style="color: #993333;">int</span> main<span style="color: #009900;">&#40;</span><span style="color: #993333;">void</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #993333;">int</span> niter <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span>
&nbsp;
  ...
&nbsp;
  <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #339933;">!</span>tchdbforeach<span style="color: #009900;">&#40;</span>tc_database_handle<span style="color: #339933;">,</span> callback<span style="color: #339933;">,</span> <span style="color: #339933;">&amp;</span>niter<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    fprintf<span style="color: #009900;">&#40;</span>stderr<span style="color: #339933;">,</span> <span style="color: #ff0000;">&quot;failed to iterate the database<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #b1b100;">return</span> EXIT_FAILURE<span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  <span style="color: #000066;">printf</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;iterated %d times<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">,</span> niter<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
  ...
&nbsp;
  <span style="color: #b1b100;">return</span> EXIT_SUCCESS<span style="color: #339933;">:</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>If all goes well, the counter variable will be set to the number of records in the database. This function is slightly more complex than using tchdbiternext() but you are guaranteed to iterate atomically which is pretty important for a table scanner.</p>
<p>I hope this function can help you too.</p>
]]></content:encoded>
			<wfw:commentRss>http://torum.net/2009/05/tokyo-cabinet-protected-database-iteration/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Journal of Storage Engine Development on Drizzle</title>
		<link>http://torum.net/2009/05/storage-engine-development-on-drizzle/</link>
		<comments>http://torum.net/2009/05/storage-engine-development-on-drizzle/#comments</comments>
		<pubDate>Tue, 12 May 2009 08:04:01 +0000</pubDate>
		<dc:creator>Toru Maesaka</dc:creator>
				<category><![CDATA[drizzle]]></category>
		<category><![CDATA[knowledge]]></category>
		<category><![CDATA[oss]]></category>
		<category><![CDATA[engine]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://torum.net/?p=1581</guid>
		<description><![CDATA[I&#8217;ve decided to start a series of blog entries on not-so-obvious findings that I&#8217;ve found while working on my new project. By archiving the findings, I&#8217;m hoping that I can help those that are looking into developing a storage engine for the MySQL family in the future. Accumulating these mini-knowledge would also be useful for [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve decided to start a series of blog entries on not-so-obvious findings that I&#8217;ve found while working on my <a href="https://launchpad.net/blitzdb">new project</a>. By archiving the findings, I&#8217;m hoping that I can help those that are looking into developing a storage engine for the MySQL family in the future.</p>
<p>Accumulating these mini-knowledge would also be useful for me since I can refer back to it when I forget something. Also, once I write enough entries I&#8217;m planning on summarizing them and making it available on the <a href="http://drizzle.org/wiki/">Drizzle Wiki</a>. If <a href="http://www.mysql.com">MySQL</a> is interested in updating the engine documentation, I would be more than happy to help there too.</p>
<p>So to begin with, I&#8217;ll describe something trivial that I stumbled across while trying to catch an error on duplicate primary key insertion to the data table.</p>
<h4>Background</h4>
<p>In brief, the database kernel does not care if the INSERT query contains a duplicate primary key for a given table or not. It is the storage engine&#8217;s job to tell the kernel that the request was invalid due to key collision. If a storage engine fails to do this, the kernel will acknowledge that the query was successful (given that no other errors were thrown) and will keep doing what it needs to do.</p>
<h4>Mechanics</h4>
<p>Data insertion is handled inside the <a href="http://forge.mysql.com/wiki/MySQL_Internals_Custom_Engine#Adding_Support_for_INSERT_to_a_Storage_Engine">write_row()</a> function that your engine must implement. The return value of this function is an integer that represents the status of the work it had done. After looking through the possible error statuses in &#8220;drizzled/base.h&#8221;, I immediately found this:</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;"><span style="color: #339933;">#define HA_ERR_FOUND_DUPP_KEY 121 /* Dupplicate key on write */</span></pre></div></div>

<p>I also looked through <a href="http://en.wikipedia.org/wiki/MyISAM">MyISAM</a> and <a href="http://en.wikipedia.org/wiki/Innodb">InnoDB</a> to confirm that this was indeed the correct error status to return on duplicate primary key. Here is the snippet of my row insertion <strong>at the time</strong>:</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;"><span style="color: #808080; font-style: italic;">/* TC's tchdbputkeep will not insert a row to the table if there
   was a collision */</span>
<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span>tchdbputkeep<span style="color: #009900;">&#40;</span>data_table<span style="color: #339933;">,</span> primary_key<span style="color: #339933;">,</span> primary_key_length<span style="color: #339933;">,</span> buf<span style="color: #339933;">,</span>
                 table<span style="color: #339933;">-&gt;</span>s<span style="color: #339933;">-&gt;</span>reclength<span style="color: #009900;">&#41;</span> <span style="color: #339933;">==</span> <span style="color: #000000; font-weight: bold;">false</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  my_errno <span style="color: #339933;">=</span> HA_ERR_GENERIC<span style="color: #339933;">;</span>
&nbsp;
  <span style="color: #808080; font-style: italic;">/* check for primary key collision */</span>
  <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span>tchdbecode<span style="color: #009900;">&#40;</span>data_table<span style="color: #009900;">&#41;</span> <span style="color: #339933;">==</span> TCEKEEP<span style="color: #009900;">&#41;</span>
    my_errno <span style="color: #339933;">=</span> HA_ERR_FOUND_DUPP_KEY<span style="color: #339933;">;</span>
&nbsp;
  <span style="color: #b1b100;">return</span> my_errno<span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>On first glimpse, this seems right but the error I was getting from the command line prompt always differed with MyISAM and InnoDB despite returning the same error status. Specifically, this is what I was getting:</p>

<div class="wp_syntax"><div class="code"><pre class="null" style="font-family:monospace;">ERROR 1022 (23000): Can't write; duplicate key in table 't1'</pre></div></div>

<p>whereas I was getting this error on other engines:</p>

<div class="wp_syntax"><div class="code"><pre class="null" style="font-family:monospace;">ERROR 1062 (23000): Duplicate entry '1' for key 'PRIMARY'</pre></div></div>

<p>At this stage I couldn&#8217;t make sense of what I was doing wrong but it turned out that the solution was pretty simple.</p>
<h4>Solution</h4>
<p>After talking to <a href="http://www.flamingspork.com/blog/">Stewart Smith</a> about my issue in #drizzle @ freenode, it turned out I am supposed to keep track of which key the duplication was found in write_row() and inform it to the kernel via the <a href="http://forge.mysql.com/wiki/MySQL_Internals_Custom_Engine#Implementing_the_info.28.29_Method">info()</a> function.</p>
<p>You can do this by setting the <em>errkey</em> integer variable to the key number that is used internally by the kernel. So, obtaining the internal primary key number with this call in write_row():</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;">share<span style="color: #339933;">-&gt;</span>errkey <span style="color: #339933;">=</span> table<span style="color: #339933;">-&gt;</span>s<span style="color: #339933;">-&gt;</span>primary_key<span style="color: #339933;">;</span></pre></div></div>

<p>and adding the following code to info():</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;"><span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span>flag <span style="color: #339933;">&amp;</span> HA_STATUS_ERRKEY<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  errkey <span style="color: #339933;">=</span> share<span style="color: #339933;">-&gt;</span>errkey<span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>happily fixed the issue I was experiencing. Yay.</p>
<p>I guess reading the section on info() in the document gives a hint that this is where you supply the key number on key-error but frankly, this is really easy to forget and miss since the importance isn&#8217;t so emphasized.</p>
<p>Anyhow, thats all I have to say in the first of this series and hopefully I&#8217;ll write something more interesting in the upcoming entries. Until then, happy hacking ;)</p>
]]></content:encoded>
			<wfw:commentRss>http://torum.net/2009/05/storage-engine-development-on-drizzle/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

