<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Toru Maesaka &#187; tokyocabinet</title>
	<atom:link href="http://torum.net/tag/tokyocabinet/feed/" rel="self" type="application/rss+xml" />
	<link>http://torum.net</link>
	<description>Hackaholic and a Web Addict based in Tokyo</description>
	<lastBuildDate>Tue, 28 Feb 2012 10:52:29 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.2</generator>
		<item>
		<title>How to Recover a Tokyo Cabinet Database</title>
		<link>http://torum.net/2010/01/how-to-recover-a-tokyo-cabinet-database-file/</link>
		<comments>http://torum.net/2010/01/how-to-recover-a-tokyo-cabinet-database-file/#comments</comments>
		<pubDate>Fri, 08 Jan 2010 07:36:47 +0000</pubDate>
		<dc:creator>Toru Maesaka</dc:creator>
				<category><![CDATA[oss]]></category>
		<category><![CDATA[recovery]]></category>
		<category><![CDATA[tokyocabinet]]></category>

		<guid isPermaLink="false">http://torum.net/?p=2324</guid>
		<description><![CDATA[Recently Mark Callaghan had asked me whether BlitzDB is crash safe since he was aware that Tokyo Cabinet isn&#8217;t crash safe (unless used with transactions). For Tokyo Cabinet and Tyrant&#8217;s defense, I should mention that this is intentional. The idea is to reduce durability in return for higher throughput. The author&#8217;s philosophy is that data [...]]]></description>
			<content:encoded><![CDATA[<p>Recently <a href="http://mysqlha.blogspot.com/">Mark Callaghan</a> had asked me whether BlitzDB is crash safe since he was aware that <a href="http://1978th.net/tokyocabinet/">Tokyo Cabinet</a> isn&#8217;t crash safe (unless used with transactions). For Tokyo Cabinet and Tyrant&#8217;s defense, I should mention that this is intentional. The idea is to reduce durability in return for higher throughput. The author&#8217;s philosophy is that data availability should be secured by replication. This makes sense since the design of TC and TT are influenced by mixi&#8217;s high traffic (we need single instances to handle over 10k requests per sec).</p>
<p>So with that said, let&#8217;s move on to the main topic. The honest answer is that BlitzDB is not crash safe either (transaction support is still a long way to go). If the admin is lucky, she would be able to repair the table(s) using the REPAIR TABLE syntax. BlitzDB&#8217;s crash safety strategy is the same as Tokyo Tyrant &#8211; You should use replication. The question is,  how do you repair a broken Tokyo Cabinet file?</p>
<p>The answer is pretty simple and it&#8217;s documented in the Japanese TC documentation. Unfortunately it&#8217;s not not present in the English documentation. So allow me to go through it with demo code in this post. There are two ways to attempt to recover a Tokyo Cabinet database:</p>
<ol>
<li>By using the Tokyo Cabinet API.</li>
<li>By using Tokyo Cabinet&#8217;s command line tool.</li>
</ol>
<p>Let&#8217;s first go through how to confirm that your database is broken. I&#8217;ve also covered how to comprehend the errors.</p>
<h3>How to confirm that your Database is broken</h3>
<p>Simply use the command line tools installed with Tokyo Cabinet. Look at the &#8220;additional flags&#8221; line on the output of &#8220;tchmgr inform&#8221; or &#8220;tcbmgr inform&#8221; depending on your database type. If it says, &#8220;fetal&#8221; then your file is really broken. If it says &#8220;open&#8221;, it means that your application died or exited without closing the database. A file in the &#8220;open&#8221; state is still usable but your most recent records are most likely unavailable. This is because TC connects the hash chain after it has confirmed that a write operation was successful. If your application died before the record is chained, then it&#8217;s not accessible in the database.</p>
<p>Furthermore, the records that weren&#8217;t sync&#8217;d by the kernel won&#8217;t be present on power failure. If the disaster was a process failure, then the written data will hopefully be in the kernel&#8217;s write buffer so you won&#8217;t lose that data. For pedantic people, TC provides a way to sync the database from your application. Whether to call this function (and how often) is up to your application&#8217;s policy.</p>
<h3>Using the Tokyo Cabinet API</h3>
<p>(1) Open the database file without the lock option. Meaning, supply HDBONOLCK or BDBONOLCK to the open function of the appropriate database type (TCHDB or TCBDB).</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;"><span style="color: #808080; font-style: italic;">/* This is for TCHDB */</span>
TCHDB <span style="color: #339933;">*</span>hdb <span style="color: #339933;">=</span> tchdbnew<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #339933;">!</span>tchdbopen<span style="color: #009900;">&#40;</span>hdb<span style="color: #339933;">,</span> <span style="color: #ff0000;">&quot;/path/to/broken_file&quot;</span><span style="color: #339933;">,</span> HDBONOLCK <span style="color: #339933;">|</span> HDBOWRITER<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #808080; font-style: italic;">/* Failed to open. Do the appropriate thing. */</span>
<span style="color: #009900;">&#125;</span>
&nbsp;
<span style="color: #808080; font-style: italic;">/* This is for TCBDB */</span>
TCBDB <span style="color: #339933;">*</span>btree <span style="color: #339933;">=</span> tcbdbnew<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #339933;">!</span>tcbdbopen<span style="color: #009900;">&#40;</span>btree<span style="color: #339933;">,</span> <span style="color: #ff0000;">&quot;/path/to/broken_file&quot;</span><span style="color: #339933;">,</span> BDBONOLCK <span style="color: #339933;">|</span> BDBOWRITER<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #808080; font-style: italic;">/* Failed to open. Do the appropriate thing. */</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>(2) Run tchdboptmize() or tcbdboptimize() depending on the database type. You might wonder what you should give as the parameter for the optimize function. Conveniently, TC stores the tuning parameters of the database when you first opened it so you can just provide -1 as an argument _but_ the final one. This is because the final argument is an unsigned integer (uint8_t). What you want to provide instead is UINT8_MAX for this.</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;"><span style="color: #808080; font-style: italic;">/* This if for TCHDB */</span>
<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #339933;">!</span>tchdboptimize<span style="color: #009900;">&#40;</span>hdb<span style="color: #339933;">,</span> <span style="color: #339933;">-</span><span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> <span style="color: #339933;">-</span><span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> <span style="color: #339933;">-</span><span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> UINT8_MAX<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #808080; font-style: italic;">/* We're out of luck. This hash database can't be rescued. */</span>
<span style="color: #009900;">&#125;</span>
&nbsp;
<span style="color: #808080; font-style: italic;">/* This if for TCBDB */</span>
<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #339933;">!</span>tcbdboptimize<span style="color: #009900;">&#40;</span>btree<span style="color: #339933;">,</span> <span style="color: #339933;">-</span><span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> <span style="color: #339933;">-</span><span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> <span style="color: #339933;">-</span><span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> <span style="color: #339933;">-</span><span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> <span style="color: #339933;">-</span><span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> UINT8_MAX<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #808080; font-style: italic;">/* We're out of luck. This b+tree database can't be rescued. */</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>If you&#8217;re lucky, the above would repair the database that is associated with TC&#8217;s database object.</p>
<h3>Using TC&#8217;s command line tool</h3>
<p>This approach is more towards database admins since I&#8217;m sure the last thing they want to do is write their own program to get their work done. Lazyness is good.</p>
<p>TC provides a utility program called tchmgr (for a hash database) and tcbmgr (for a b+tree database) which allows you to run optimize on a database file. So if you wanted to repair a TC hash database, you would do the following:</p>

<div class="wp_syntax"><div class="code"><pre class="null" style="font-family:monospace;">$ tchmgr optimize -nl /path/to/broken_file</pre></div></div>

<p>and the following for the B+Tree Database:</p>

<div class="wp_syntax"><div class="code"><pre class="null" style="font-family:monospace;">$ tcbmgr optimize -nl /path/to/broken_file</pre></div></div>

<p>For those that are interested, the &#8220;-nl&#8221; option means &#8220;No Lock&#8221; which is required to repair a database file.</p>
<p>Well, I guess this sums up this blog post. I hope this post will help you administrate Tokyo Tyrant and/or your Tokyo Cabinet based application!</p>
]]></content:encoded>
			<wfw:commentRss>http://torum.net/2010/01/how-to-recover-a-tokyo-cabinet-database-file/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Congrats to Kyoto Cabinet&#8217;s Alpha Release</title>
		<link>http://torum.net/2010/01/kyotocabinet-alpha-release/</link>
		<comments>http://torum.net/2010/01/kyotocabinet-alpha-release/#comments</comments>
		<pubDate>Fri, 01 Jan 2010 16:36:30 +0000</pubDate>
		<dc:creator>Toru Maesaka</dc:creator>
				<category><![CDATA[oss]]></category>
		<category><![CDATA[kyotocabinet]]></category>
		<category><![CDATA[tokyocabinet]]></category>

		<guid isPermaLink="false">http://torum.net/?p=2322</guid>
		<description><![CDATA[Writing this blog entry to congratulate Mikio for releasing Kyoto Cabinet (alpha release), which is positioned as a successor project of Tokyo Cabinet. From development perspective, the big difference is that Kyoto Cabinet is implemented in pure C++03 whereas TC is pure C99. ASFAIK, C++03 was adopted so that KC can run on broader platforms [...]]]></description>
			<content:encoded><![CDATA[<p>Writing this blog entry to congratulate <a href="http://1978th.net">Mikio</a> for releasing <a href="http://1978th.net/kyotocabinet/">Kyoto Cabinet</a> (alpha release), which is positioned as a successor project of Tokyo Cabinet.</p>
<p>From development perspective, the big difference is that Kyoto Cabinet is implemented in pure C++03 whereas TC is pure C99. ASFAIK, C++03 was adopted so that KC can run on broader platforms than POSIX oriented systems (So, theoretically it can build on Windows). From project perspective, KC is licensed under GPLv3 whereas TC is licensed under LGPL.</p>
<p>For my projects, I currently have no interest in moving to KC from TC since I&#8217;m convinced that TC is better suited for my target platforms (UNIX/Linux). Another reason (VERY personal reason) is that I like C99 more than C++. Although&#8230; I don&#8217;t have the right to say this since I&#8217;m far from proficient at C++. For more details on the project differences, you should take a look at KC&#8217;s project page that Mikio had prepared:</p>
<ul>
<li><a href="http://1978th.net/kyotocabinet/">http://1978th.net/kyotocabinet/</a></li>
</ul>
<p>For those that are interested in KC&#8217;s performance, I&#8217;ll do some benchmarks against TC when I get back from my end of year vacation. Hope everyone is enjoying New Years!</p>
]]></content:encoded>
			<wfw:commentRss>http://torum.net/2010/01/kyotocabinet-alpha-release/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>BlitzDB and Tokyo Cabinet Concurrency Model</title>
		<link>http://torum.net/2009/11/blitzdb-and-tc-concurrency-model/</link>
		<comments>http://torum.net/2009/11/blitzdb-and-tc-concurrency-model/#comments</comments>
		<pubDate>Thu, 19 Nov 2009 13:29:35 +0000</pubDate>
		<dc:creator>Toru Maesaka</dc:creator>
				<category><![CDATA[drizzle]]></category>
		<category><![CDATA[oss]]></category>
		<category><![CDATA[blitzdb]]></category>
		<category><![CDATA[concurrency]]></category>
		<category><![CDATA[locking]]></category>
		<category><![CDATA[parallelism]]></category>
		<category><![CDATA[pthread]]></category>
		<category><![CDATA[tokyocabinet]]></category>

		<guid isPermaLink="false">http://torum.net/?p=2307</guid>
		<description><![CDATA[Yesterday I sat in front of a whiteboard for few hours with Mikio, the author of Tokyo Cabinet discussing/debating what the optimal concurrency model would be for BlitzDB. I think we came to a pretty good conclusion so I&#8217;m going to note it on this entry. But before I step any further, allow me to [...]]]></description>
			<content:encoded><![CDATA[<p>Yesterday I sat in front of a whiteboard for few hours with Mikio, the author of Tokyo Cabinet discussing/debating what the optimal concurrency model would be for BlitzDB. I think we came to a pretty good conclusion so I&#8217;m going to note it on this entry. But before I step any further, allow me to go over Tokyo Cabinet&#8217;s concurrency model.</p>
<h3>Tokyo Cabinet&#8217;s Concurrency Model</h3>
<p>Tokyo Cabinet is <a href="http://www.google.com/search?rls=en&#038;q=tokyo+cabinet+single+writer">often quoted</a> as &#8220;single writer, multi reader&#8221; but this is <strong>not quite</strong> true. At the time of this blog entry, this statement holds true for TC&#8217;s B+Tree database but TC&#8217;s hash database can actually allow multiple writers to update and/or delete records concurrently.</p>
<p>If you look at the entry point of tchdbput(), you will notice that it is actually obtaining a reader&#8217;s lock (in terms of <a href="http://en.wikipedia.org/wiki/Readers-writer_lock">rwlock</a>). TCHDB then hashes the provided key and obtains the bucket index number where the record of interest belongs to. Given the bucket/block to work on, TC then looks at the 8 most significant bits of the hash value and attempts to obtain a granular update lock from slots of 256 mutexes (2 ^ 8 = 256). So, things are still concurrent at this stage though there are <em>some</em> chances of collision that would block a thread.</p>
<p>If a record already exists, TC will go on and happily update that block but if the record is new (as in the key doesn&#8217;t exist), TC will lock the tail block of the database and write the new record there. So, only writing a new record is treated as a  single writer and the rest can be processed concurrently. This is why I said it&#8217;s <strong>not quite</strong> true.</p>
<h3>BlitzDB&#8217;s Concurrency Model</h3>
<p>Taken the above into mind, this is what BlitzDB&#8217;s concurrency model  looks like:</p>
<ol>
<li>SELECT queries can run concurrently.</li>
<li>SELECT queries are blocked when UPDATE and/or REPLACE queries are being processed.</li>
<li>UPDATE, REPLACE, DELETE queries can run concurrently.</li>
<li>INSERT is never disrupted by BlitzDB and scheduled by TC.</li>
</ol>
<p>In an ideal world, I would allow Drizzle&#8217;s worker threads to _directly_ interact with TC and let TC handle thread synchronization. This would make my life fantastically easy but unfortunately life isn&#8217;t so easy.</p>
<p>For example, if a record is deleted while BlitzDB&#8217;s table scan is occurring, the table scanner will stop scanning at the position where the deleted key existed. I would not have this problem if I used TC&#8217;s native iterator but my table scan implementation uses TC&#8217;s <a href="http://torum.net/2009/10/iterating-tokyo-cabinet-in-parallel/">hidden API</a> that won&#8217;t babysit me in this regard. In return I can gain maximum concurrent read throughput from TC which was a tradeoff I happily accepted.</p>
<p>So, there are several little gotchas like this which forces me to implement concurrency control in BlitzDB. Here&#8217;s how I&#8217;m planning on doing it (with demo code!).</p>
<h3>Implementation (with demo code)</h3>
<p>In the past I&#8217;ve gone through several experimental stages with BlitzDB where I used pthread&#8217;s rwlock to control concurrency. Short answer to the result is, &#8220;IT WORKS!&#8221;. However it was not taking full advantage of TC&#8217;s concurrency model.</p>
<p>For example I did not want to protect UPDATE queries with a writer&#8217;s lock since it would block other UPDATE/DELETE queries. So why not protect it with a reader&#8217;s lock? The issue here is that any query that can change the state of the table cannot be processed while a scanner is running (which btw is protected by a reader&#8217;s lock). Furthermore, a non-index based update/delete means that the scanner _is_ running so there&#8217;s a problem there too.</p>
<p>What I need is a scheduler that can allow multiple INSERT/UPDATE/REPLACE/DELETE queries to run when the scanner is not running. On the other hand the scheduler must allow multiple scanners to run when an UPDATE/REPLACE/DELETE queries aren&#8217;t being processed _BUT_ let INSERT queries come through to TC.</p>
<p>Implementing the above is probably possible by using multiple mutexes but it would bring complexity to the codebase and possible deadlocks that can be difficult to debug. So we decided to learn from pthread&#8217;s rwlock implementation and write an original lock mechanism similar to rwlock but something that allows us to write our own rules for scheduling.</p>
<p>Here&#8217;s my first attempt at a standalone sandbox of the model:</p>
<ul>
<li><a href="http://torum.net/code/cc/blitzlock.cc">http://torum.net/code/cc/blitzlock.cc</a></li>
</ul>
<p>You can compile and run it to get a grasp of how threads are coordinated:</p>
<pre>$ wget http://torum.net/code/cc/blitzlock.cc
$ g++ -Wall -pedantic blitzlock.cc -lpthread &#038;&#038; ./a.out</pre>
<p>If you got the program running and wondering what the output means, think of the &#8220;updater&#8221; as a thread that performs either UPDATE, REPLACE or DELETE.</p>
<p>There are much more that I&#8217;d love to go on about but I think I&#8217;ve bloated this entry enough so I will save my urge for another day :)</p>
]]></content:encoded>
			<wfw:commentRss>http://torum.net/2009/11/blitzdb-and-tc-concurrency-model/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Iterating Tokyo Cabinet in Parallel</title>
		<link>http://torum.net/2009/10/iterating-tokyo-cabinet-in-parallel/</link>
		<comments>http://torum.net/2009/10/iterating-tokyo-cabinet-in-parallel/#comments</comments>
		<pubDate>Wed, 21 Oct 2009 12:04:59 +0000</pubDate>
		<dc:creator>Toru Maesaka</dc:creator>
				<category><![CDATA[oss]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[tokyocabinet]]></category>

		<guid isPermaLink="false">http://torum.net/?p=2296</guid>
		<description><![CDATA[Iterating a Tokyo Cabinet database (both B+Tree and Hash Table) is fairly easy as I&#8217;ve described it in the past. However, things are different if you want to allow multiple threads to iterate the database individually. This is because iterating TC in a &#8220;standard&#8221; way limits you to obtain only one iterator per database object. [...]]]></description>
			<content:encoded><![CDATA[<p>Iterating a Tokyo Cabinet database (both B+Tree and Hash Table) is fairly easy as I&#8217;ve <a href="http://torum.net/2009/05/tokyo-cabinet-protected-database-iteration/">described it in the past</a>. However, things are different if you want to allow multiple threads to iterate the database individually. This is because iterating TC in a &#8220;standard&#8221; way limits you to obtain only one iterator per database object.</p>
<h3>The Problem</h3>
<p>Iterating in a standard way means no matter how hard you try, only one thread can iterate the database at a time. This obviously kills concurrent performance and it is something you want to avoid at all costs. Why do I care? well, this is a pain for developing a relational storage engine because it means that the table scanner can&#8217;t be run in parallel. In terms of MySQL/Drizzle internals, threads will have to wait until the scanning thread exits rnd_end(). Not good at all.</p>
<h3>The Solution</h3>
<p>Fortunately Mikio (the author of TC) was aware of this issue from the beginning and has provided a series of hidden functions that can solve this problem. He actually has other hidden functionalities in TC that is only documented in the header file. He has no plans on deleting those and mentioned that they are only for experts that can actually be bothered reading the header file.</p>
<p>The function we&#8217;re interested in is a key-based iterator called tchdbgetnext() and I will introduce my favorite of the series, <strong>tchdbgetnext3()</strong>. The idea behind tchdbgetnext() is &#8212; Given a particular key, TC will return a key/value pair of the next record. The database can then be iterated by continuously throwing the returned key at tchdbgetnext(). The first record in the database can be obtained by providing a NULL key.</p>
<p>You&#8217;ll see why Mikio decided to hide this function in the following example with tchdbgetnext3():</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;"><span style="color: #993333;">const</span> <span style="color: #993333;">char</span> <span style="color: #339933;">*</span>fetched_data<span style="color: #339933;">;</span>
<span style="color: #993333;">char</span> <span style="color: #339933;">*</span>current_key <span style="color: #339933;">=</span> NULL<span style="color: #339933;">;</span>
<span style="color: #993333;">char</span> <span style="color: #339933;">*</span>last_key <span style="color: #339933;">=</span> NULL<span style="color: #339933;">;</span>
&nbsp;
<span style="color: #993333;">int</span> fetched_data_len <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span>
<span style="color: #993333;">int</span> current_key_len <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span>
<span style="color: #993333;">int</span> last_key_len <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #b1b100;">while</span> <span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">true</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  last_key <span style="color: #339933;">=</span> current_key<span style="color: #339933;">;</span>
  last_key_len <span style="color: #339933;">=</span> current_key_len<span style="color: #339933;">;</span>
&nbsp;
  current_key <span style="color: #339933;">=</span> tchdbgetnext3<span style="color: #009900;">&#40;</span>tokyo_hashdb_handle<span style="color: #339933;">,</span> last_key<span style="color: #339933;">,</span>
                              last_key_len<span style="color: #339933;">,</span> <span style="color: #339933;">&amp;</span>current_key_len<span style="color: #339933;">,</span>
                              <span style="color: #339933;">&amp;</span>fetched_data<span style="color: #339933;">,</span> <span style="color: #339933;">&amp;</span>fetched_data_len<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
  <span style="color: #808080; font-style: italic;">/* This will free the value as well. Explained in the blog entry.*/</span>
  free<span style="color: #009900;">&#40;</span>last_key<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
  <span style="color: #808080; font-style: italic;">/* The entire database has been iterated */</span>
  <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span>current_key <span style="color: #339933;">==</span> NULL<span style="color: #009900;">&#41;</span>
    <span style="color: #000000; font-weight: bold;">break</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>You might be wondering why the &#8220;last_key&#8221; pointer is needed above. The answer is, tchdbgetnext3() returns an allocated pointer to the next key so continuously using the same pointer will result in a memory leak. To avoid this leak, &#8220;last_key&#8221; is used to remember where &#8220;current_key&#8221; had pointed to before it was re-pointed by tchdbgetnext3(). Confused? This takes a bit of thinking to understand hence it is only documented in the TC header file.</p>
<p>Another tricky but nice thing about tchdbgetnext3() is that it only calls malloc(3) once for each key/value pair. Usually malloc(3) is called individually for both key/value but with tchdbgetnext3(), TC allocates a buffer with enough space to accommodate both key/value and copies them next to each other. So, the pointers for key and value that tchdbgetnext3() sets on success is actually on the same buffer. This is why I only call free(3) on the key pointer in the above example. It frees the entire buffer which includes the value region.</p>
<p>Again, this takes a little bit of thinking to understand but it can cutdown malloc(3) call by half which can mean a lot to some people.</p>
]]></content:encoded>
			<wfw:commentRss>http://torum.net/2009/10/iterating-tokyo-cabinet-in-parallel/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Writing records with duplicate keys to Tokyo Cabinet</title>
		<link>http://torum.net/2009/09/writing-duplicate-keys-to-tokyocabinet/</link>
		<comments>http://torum.net/2009/09/writing-duplicate-keys-to-tokyocabinet/#comments</comments>
		<pubDate>Wed, 09 Sep 2009 14:56:08 +0000</pubDate>
		<dc:creator>Toru Maesaka</dc:creator>
				<category><![CDATA[knowledge]]></category>
		<category><![CDATA[oss]]></category>
		<category><![CDATA[btree]]></category>
		<category><![CDATA[tokyocabinet]]></category>

		<guid isPermaLink="false">http://torum.net/?p=2287</guid>
		<description><![CDATA[Lately I&#8217;ve been noticing that people are visiting my blog to find ways to write multiple records with the same key to a Tokyo Cabinet (TC) database. Well, the answer depends on which data structure you choose to construct a TC database. If you&#8217;re interested in TC&#8217;s hash database then you&#8217;re out of luck but [...]]]></description>
			<content:encoded><![CDATA[<p>Lately I&#8217;ve been noticing that people are visiting my blog to find ways to write multiple records with the same key to a <a href="http://1978th.net/tokyocabinet/">Tokyo Cabinet</a> (TC) database.</p>
<p>Well, the answer depends on which data structure you choose to construct a TC database. If you&#8217;re interested in TC&#8217;s hash database then you&#8217;re out of luck but <a href="http://1978th.net/tokyocabinet/spex-en.html#tcbdbapi">TC&#8217;s B+Tree database</a> will allow you to write duplicate keys. If you just want the answer, here&#8217;s a <a href="http://torum.net/code/c/tc_dupkey.c">compilable source</a> of how to do it. For those that are interested in how it works, keep on reading :)</p>
<p>So here&#8217;s how it&#8217;s done. You write the record(s) using TC&#8217;s tcbdbputdup() function so that upon key collision, TC will write the record next to the existing one. The following snippet will write three records to Tokyo Cabinet using an identical key.</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;"><span style="color: #993333;">const</span> <span style="color: #993333;">char</span> <span style="color: #339933;">*</span>key <span style="color: #339933;">=</span> <span style="color: #ff0000;">&quot;key&quot;</span><span style="color: #339933;">;</span>
<span style="color: #993333;">const</span> <span style="color: #993333;">char</span> <span style="color: #339933;">*</span>r1 <span style="color: #339933;">=</span> <span style="color: #ff0000;">&quot;record 1&quot;</span><span style="color: #339933;">;</span>
<span style="color: #993333;">const</span> <span style="color: #993333;">char</span> <span style="color: #339933;">*</span>r2 <span style="color: #339933;">=</span> <span style="color: #ff0000;">&quot;record 2&quot;</span><span style="color: #339933;">;</span>
<span style="color: #993333;">const</span> <span style="color: #993333;">char</span> <span style="color: #339933;">*</span>r3 <span style="color: #339933;">=</span> <span style="color: #ff0000;">&quot;record 3&quot;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #808080; font-style: italic;">/* store three different records with the same key.
   note that &quot;database_handle&quot; is a TCBDB object. */</span>
<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #339933;">!</span>tcbdbputdup<span style="color: #009900;">&#40;</span>database_handle<span style="color: #339933;">,</span> key<span style="color: #339933;">,</span> <span style="color: #0000dd;">3</span><span style="color: #339933;">,</span> r1<span style="color: #339933;">,</span> strlen<span style="color: #009900;">&#40;</span>r1<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">||</span>
    <span style="color: #339933;">!</span>tcbdbputdup<span style="color: #009900;">&#40;</span>database_handle<span style="color: #339933;">,</span> key<span style="color: #339933;">,</span> <span style="color: #0000dd;">3</span><span style="color: #339933;">,</span> r2<span style="color: #339933;">,</span> strlen<span style="color: #009900;">&#40;</span>r2<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">||</span>
    <span style="color: #339933;">!</span>tcbdbputdup<span style="color: #009900;">&#40;</span>database_handle<span style="color: #339933;">,</span> key<span style="color: #339933;">,</span> <span style="color: #0000dd;">3</span><span style="color: #339933;">,</span> r3<span style="color: #339933;">,</span> strlen<span style="color: #009900;">&#40;</span>r3<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  fprintf<span style="color: #009900;">&#40;</span>stderr<span style="color: #339933;">,</span> <span style="color: #ff0000;">&quot;failed to store data<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
  <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #339933;">!</span>tcbdbclose<span style="color: #009900;">&#40;</span>database_handle<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    fprintf<span style="color: #009900;">&#40;</span>stderr<span style="color: #339933;">,</span> <span style="color: #ff0000;">&quot;failed to close the database<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #b1b100;">return</span> <span style="color: #0000dd;">1</span><span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span>   
  tcbdbdel<span style="color: #009900;">&#40;</span>database_handle<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
  <span style="color: #b1b100;">return</span> <span style="color: #0000dd;">1</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>Something to watch out here is that because you&#8217;ve allowed duplication, running the above code multiple times will respectively keep appending the records to the database.</p>
<p>The next question is, how do we retrieve _only_ the records that corresponds to the key that we just inserted with. Simple! just traverse the tree from the first occurrence of the key and keep retrieving the data as we go until we hit a different key.</p>
<p>First thing that must be done is to create a cursor and move it to the first occurrence of the key.</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;">BDBCUR <span style="color: #339933;">*</span>cursor<span style="color: #339933;">;</span>
&nbsp;
<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span>cursor <span style="color: #339933;">=</span> tcbdbcurnew<span style="color: #009900;">&#40;</span>db<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">==</span> NULL<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #808080; font-style: italic;">/* FAIL. do the right thing for your application */</span> 
<span style="color: #009900;">&#125;</span>
&nbsp;
<span style="color: #808080; font-style: italic;">/* move the cursor to the first occurrence of the key */</span>
<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #339933;">!</span>tcbdbcurjump<span style="color: #009900;">&#40;</span>cursor<span style="color: #339933;">,</span> key<span style="color: #339933;">,</span> strlen<span style="color: #009900;">&#40;</span>key<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #808080; font-style: italic;">/* FAIL. do the right thing for your application */</span> 
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>Now we&#8217;re ready to traverse the tree. Remember that we&#8217;re only interested in a certain key so we only want to traverse the tree until we hit a different key. The following code snippet will do exactly that and print the discovered record as it traverses the tree. So in our case it would print, “record 1″, “record 2″ and “record 3″.</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;"><span style="color: #993333;">char</span> <span style="color: #339933;">*</span>fetched_key<span style="color: #339933;">;</span>
<span style="color: #993333;">char</span> <span style="color: #339933;">*</span>fetched_value<span style="color: #339933;">;</span>
&nbsp;
<span style="color: #808080; font-style: italic;">/* traverse the tree. terminates if the entire tree is
   traversed _OR_ if it hits a different key */</span>
<span style="color: #b1b100;">while</span> <span style="color: #009900;">&#40;</span>tcbdbcurkey2<span style="color: #009900;">&#40;</span>cursor<span style="color: #009900;">&#41;</span> <span style="color: #339933;">!=</span> NULL<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  fetched_key <span style="color: #339933;">=</span> tcbdbcurkey2<span style="color: #009900;">&#40;</span>cursor<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
  <span style="color: #808080; font-style: italic;">/* different key so break out of the loop */</span>
  <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span>strcmp<span style="color: #009900;">&#40;</span>key<span style="color: #339933;">,</span> fetched_key<span style="color: #009900;">&#41;</span> <span style="color: #339933;">!=</span> <span style="color: #0000dd;">0</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    free<span style="color: #009900;">&#40;</span>fetched_key<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">break</span><span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  fetched_value <span style="color: #339933;">=</span> tcbdbcurval2<span style="color: #009900;">&#40;</span>cursor<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
  <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span>fetched_value<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    fprintf<span style="color: #009900;">&#40;</span>stdout<span style="color: #339933;">,</span> <span style="color: #ff0000;">&quot;fetched: %s<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">,</span> fetched_value<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    free<span style="color: #009900;">&#40;</span>fetched_value<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span>
  tcbdbcurnext<span style="color: #009900;">&#40;</span>cursor<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>The above tree traversal requires one additional lookup to terminate (if the entire tree isn&#8217;t traversed) but the chances are that the records are stored in the same page so this additional operation is cheap.</p>
<p>Alternatively, TC provides a function called tcbdbget4() which returns an allocated list of records that corresponds to the key you provide. If you decide to take this approach, you should consider whether the memory allocation cost and linked list construction overhead is feasible for your application or not.</p>
]]></content:encoded>
			<wfw:commentRss>http://torum.net/2009/09/writing-duplicate-keys-to-tokyocabinet/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Tokyo Cabinet Tip: Protected Database Iteration</title>
		<link>http://torum.net/2009/05/tokyo-cabinet-protected-database-iteration/</link>
		<comments>http://torum.net/2009/05/tokyo-cabinet-protected-database-iteration/#comments</comments>
		<pubDate>Wed, 13 May 2009 06:29:17 +0000</pubDate>
		<dc:creator>Toru Maesaka</dc:creator>
				<category><![CDATA[knowledge]]></category>
		<category><![CDATA[oss]]></category>
		<category><![CDATA[storage]]></category>
		<category><![CDATA[tip]]></category>
		<category><![CDATA[tokyocabinet]]></category>

		<guid isPermaLink="false">http://torum.net/?p=1688</guid>
		<description><![CDATA[Tokyo Cabinet (TC) provides iteration functionality for both it&#8217;s persistent and non-persistent data structures. For example, if you wanted to iterate through TC&#8217;s hash database, you can use the tchdbiternext() function. This is really straight forward to use such that: void *key; int key_len; &#160; if &#40;tchdbiterinit&#40;tc_database_handle&#41; != true&#41; &#123; /* failed to initialize iterator [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://tokyocabinet.sourceforge.net">Tokyo Cabinet</a> (TC) provides iteration functionality for both it&#8217;s persistent and non-persistent data structures. For example, if you wanted to iterate through TC&#8217;s hash database, you can use the  tchdbiternext() function. This is really straight forward to use such that:</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;"><span style="color: #993333;">void</span> <span style="color: #339933;">*</span>key<span style="color: #339933;">;</span>
<span style="color: #993333;">int</span> key_len<span style="color: #339933;">;</span>
&nbsp;
<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span>tchdbiterinit<span style="color: #009900;">&#40;</span>tc_database_handle<span style="color: #009900;">&#41;</span> <span style="color: #339933;">!=</span> <span style="color: #000000; font-weight: bold;">true</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #808080; font-style: italic;">/* failed to initialize iterator */</span>
<span style="color: #009900;">&#125;</span>
&nbsp;
<span style="color: #b1b100;">while</span> <span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span>key <span style="color: #339933;">=</span> tchdbiternext<span style="color: #009900;">&#40;</span>tc_database_handle<span style="color: #339933;">,</span> <span style="color: #339933;">&amp;</span>key_len<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">!=</span> NULL<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #808080; font-style: italic;">/* work with the fetched key and key_len */</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>will iterate through the entire hash database that &#8220;tc_database_handle&#8221; object is responsible for. This can be handy if you need to loop through your database for some arbitrary reason.</p>
<p>However, there is a consequence in using this function in a concurrent environment with a use-case where the order of records _really_ matter. This is because even though TC is a thread-safe library, the iteration functions aren&#8217;t thread-safe in a way that we expect.</p>
<p>For example, if a write operation occurs while the application iterates over the database, you will end up iterating over a database that is in a changed state. This will not make the cursor go crazy and crash your application since TC handles this internally but you still end up iterating over a database that is in a state that you did not initially intend on looping through.</p>
<p>Solution to this is to simply block write operations to the database while your application iterates through. For example, you could use pthread&#8217;s <a href="http://en.wikipedia.org/wiki/Readers-writer_lock">rw_lock</a> to allow other threads to read while you iterate but block writes until you finish iterating.</p>
<p>I was planning on doing this for a table scanner in the <a href="https://launchpad.net/blitzdb">storage engine</a> that I&#8217;m currently working on but turns out TC has an undocumented function that will take care of this internally. I&#8217;ve talked to Mikio about this function and apparently it is intentional that he hasn&#8217;t documented it on his <a href="http://tokyocabinet.sourceforge.net/spex-en.html">specification page</a>. He has no plans on throwing it out so you do not have to worry about it to magically disappear one day. For more information, you can take a look at his header file (tchdb.h for hash database).</p>
<h4>Explanation and Simple Example</h4>
<p>The function is called <strong>tchdbforeach()</strong> which will atomically iterate through your database from beginning to the end by supplying each key/value pair to the callback function that you provide. The signature of the callback is the following:</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;">bool callback<span style="color: #009900;">&#40;</span><span style="color: #993333;">const</span> <span style="color: #993333;">void</span> <span style="color: #339933;">*</span>kbuf<span style="color: #339933;">,</span> <span style="color: #993333;">int</span> ksiz<span style="color: #339933;">,</span> <span style="color: #993333;">const</span> <span style="color: #993333;">void</span> <span style="color: #339933;">*</span>vbuf<span style="color: #339933;">,</span>
              <span style="color: #993333;">int</span> vsiz<span style="color: #339933;">,</span> <span style="color: #993333;">void</span> <span style="color: #339933;">*</span>op<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p>where the fifth argument, &#8220;void *op&#8221; is an opaque pointer to the data that you can pass to the callback. Here is a simple example that will increment a counter integer on each iteration using this function:</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;"><span style="color: #808080; font-style: italic;">/* Do whatever you like with the provided key/value pair in here */</span>
bool callback<span style="color: #009900;">&#40;</span><span style="color: #993333;">const</span> <span style="color: #993333;">void</span> <span style="color: #339933;">*</span>kbuf<span style="color: #339933;">,</span> <span style="color: #993333;">int</span> ksiz<span style="color: #339933;">,</span> <span style="color: #993333;">const</span> <span style="color: #993333;">void</span> <span style="color: #339933;">*</span>vbuf<span style="color: #339933;">,</span>
              <span style="color: #993333;">int</span> vsiz<span style="color: #339933;">,</span> <span style="color: #993333;">void</span> <span style="color: #339933;">*</span>op<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span>op <span style="color: #339933;">==</span> NULL<span style="color: #009900;">&#41;</span>
    <span style="color: #b1b100;">return</span> <span style="color: #000000; font-weight: bold;">false</span><span style="color: #339933;">;</span>
&nbsp;
  <span style="color: #339933;">*</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span><span style="color: #993333;">int</span> <span style="color: #339933;">*</span><span style="color: #009900;">&#41;</span>op<span style="color: #009900;">&#41;</span> <span style="color: #339933;">+=</span> <span style="color: #0000dd;">1</span><span style="color: #339933;">;</span>
&nbsp;
  <span style="color: #b1b100;">return</span> <span style="color: #000000; font-weight: bold;">true</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span>
&nbsp;
<span style="color: #993333;">int</span> main<span style="color: #009900;">&#40;</span><span style="color: #993333;">void</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
  <span style="color: #993333;">int</span> niter <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span>
&nbsp;
  ...
&nbsp;
  <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #339933;">!</span>tchdbforeach<span style="color: #009900;">&#40;</span>tc_database_handle<span style="color: #339933;">,</span> callback<span style="color: #339933;">,</span> <span style="color: #339933;">&amp;</span>niter<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    fprintf<span style="color: #009900;">&#40;</span>stderr<span style="color: #339933;">,</span> <span style="color: #ff0000;">&quot;failed to iterate the database<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #b1b100;">return</span> EXIT_FAILURE<span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  <span style="color: #000066;">printf</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;iterated %d times<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #339933;">,</span> niter<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
  ...
&nbsp;
  <span style="color: #b1b100;">return</span> EXIT_SUCCESS<span style="color: #339933;">:</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>If all goes well, the counter variable will be set to the number of records in the database. This function is slightly more complex than using tchdbiternext() but you are guaranteed to iterate atomically which is pretty important for a table scanner.</p>
<p>I hope this function can help you too.</p>
]]></content:encoded>
			<wfw:commentRss>http://torum.net/2009/05/tokyo-cabinet-protected-database-iteration/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
	</channel>
</rss>

