Archive

Archive for December, 2008

Ever wished for more Cache per Node?

December 31st, 2008

Ever wished that you could pack more data into a certain memcached instance without scaling up? AKA, reduce the need to scale-out, therefore save money on your infrastructure cost and electricity?

How about saving 8 bytes for every item that you cache? meaning if you cache 10 million small items, you save around 76 megabytes of RAM. Guess what? you can with the upcoming 1.3 series! and the code that achieves this by Trond Norbye is now merged into the 1.3 development tree.

I must mention though that this saving is only available to those that do not use the CAS feature. Keep reading to understand why…

The Magic

In the current release of memcached, an 8 byte value (uint64_t) is hardwired to the item structure which is now optional by specifying it at startup (the new ‘-C’ option). In the code level, the easiest way to describe it is that the CAS value is moved/aligned to the tail of the structure on memory. Needless to say, we know whether that 8 byte value exists or not from the startup option.

Observation

So, here is a very modest test that sets one-hundred-thousand items (unless there is a key clash…).

Firstly the Perl script:

#!/usr/bin/perl
 
use strict;
use warnings;
 
use Cache::Memcached;
use constant {
    NITEMS => 100000,
    KLEN   => 8,
    VLEN   => 16,
};
 
sub random_key {
    my @chars = ('a'..'z', 'A'..'Z', '0'..'9');
    my $buf = "";
 
    foreach (1..KLEN) {
        $buf .= $chars[rand @chars];
    }
    return $buf;
}
 
my $memc = Cache::Memcached->new ({
    servers => ['localhost:11211'],
    debug   => 0,
});
 
my $val = 'a' x VLEN;
 
for (1..NITEMS) {
    $memc->set(random_key, $val);
}

Running this script against an instance with standard growth factor:

 memcached -m 1024 -d

produced the following statistics on number of items and bytes consumed:

...
STAT bytes 9000000
STAT curr_items 100000
STAT total_items 100000
STAT evictions 0
END

Similarly, running the above script against an instance with the new option (‘-C’):

memcached -m 1024 -C -d

had produced the following statistics:

...
STAT bytes 8200000
STAT curr_items 100000
STAT total_items 100000
STAT evictions 0
END

Thats a saving of 781 kilobytes batman!

This is great news for those that cache lots of small objects. For example, at mixi.jp we are probably the largest user of memcached in Japan and turns out the granularity of our cache is darn fine. This patch is going to hit us big.

++ to Domas for groaning… I mean indicating this improvement opportunity and Trond for hacking it :)

Toru Maesaka memcached, oss

Introducing Digest::MurmurHash for Perl

December 29th, 2008

I noticed that an interface to the MurmurHash algorithm wasn’t available on CPAN so I quickly whipped up an XS module for it and it is now shipped to CPAN. The module itself is very simple and only exports one function that will return a corresponding uint32_t value for the original data that you feed. All you need is a C99 compliant compiler and a build environment to get this module up and running (and Perl of course).

http://search.cpan.org/~tmaesaka/Digest-MurmurHash-0.10/

If you’ve never heard of MurmurHash, it is a hash function that provides excellent speed, collision resistance and distribution characteristics. If you’re interested in the details, Appleby’s experiment result is a good place to look at.

To illustrate the speed of this module, I compared it with various other hash modules on CPAN in the following way, and resulted in this outcome.

#!/usr/bin/perl
 
use strict;
use warnings;
 
use Benchmark qw(timethese);
use Digest::FNV        qw(fnv);
use Digest::JHash      qw(jhash);
use Digest::MD5        qw(md5);
use Digest::MurmurHash qw(murmur_hash);
use Digest::Pearson    qw(pearson);
use Digest::SHA1       qw(sha1);
use String::CRC32;
 
my $data = "some_random_string_to_hash";
 
timethese(100000000, {
    crc32   => "crc32($data)",
    fowler  => "fnv($data)",
    jenkins => "jhash($data)",
    md5     => "md5($data)",
    murmur  => "murmur_hash($data)",
    pearson => "pearson($data)",
    sha1    => "sha1($data)",
});

To be honest, being the fastest in this particular benchmark isn’t so important but I was glad to see that it is fast enough to be a possible candidate for those that are looking for a hash function in Perl. FYI, looks like the folks at Ruby are considering the algorithm (watch out, it’s in Japanese!) for their internal hash function.

Oh, and this happens to be my first ever CPAN module too! Thanks to lyokato and bonnu for guiding me through the process of packaging and shipping the module and tokuhirom for helping me with XS issues.

I know, its a tiny module but it was enough to get me excited :)

Toru Maesaka oss

Drizzle’s String Library Diet

December 16th, 2008

Lately I’ve been spending most of my time with Drizzle working towards the Cirrus milestone. Specifically speaking, I’ve been slowly standardizing the codebase by throwing out lots of code in MySQL’s string library and replacing them with appropriate libc and C++ alternatives.

You see, back in the 80s MySQL had reinvented a lot of the string functionalities provided by libc for reasons that I do not know (because it was before my time). Turns out that most of the code is still in use today and I guess there was a good reason back in the day but nowadays this doesn’t seem to make much sense, since:

  • Despite the criticisms, glibc works darn well.
  • The priority of optimizing library functions is much higher for standard library developers than it is for you as an application developer.
  • Using the standard library also helps new Drizzle community developers understand the codebase much faster from seeing functions that they are already familiar with.

Arguably, being returned a pointer to the terminating NULL like most of MySQL functions makes string appending slightly easier but if you ask me, many people (including myself) are not comfortable with this and it makes the codebase look weird, IMHO. An example of this is having to rewind the pointer when passing the string to a third-party function.

Benefits gained from narrowing to UTF-8

Because UTF-8 is the prominent encoding in the areas that we are targeting (web and the cloud), currently Drizzle uses only UTF-8 for its internal representation. So needless to say, support for anything other than UTF-8 were thrown out from the library which helped reduce the size of the library greatly.

Interested in how much slimmer the Drizzle string library is compared to the original one in MySQL 5.1? To illustrate the difference, here are the results from counting the files and lines:

$ wc -l mysql-5.1.30/strings/*.c
...
96798 total
 
$ ll mysql-5.1.30/strings/ | wc -l
78
$ wc -l drizzle/mystrings/*.cc
...
24634 total
 
$ ll drizzle/mystrings/ | wc -l
31

AWESOME.

Toru Maesaka drizzle, oss ,