Toru Maesaka

Web addict and a hackaholic based in Tokyo

Archive for the ‘drizzle’ Category

Drizzle’s String Library Diet

with one comment

Lately I’ve been spending most of my time with Drizzle working towards the Cirrus milestone. Specifically speaking, I’ve been slowly standardizing the codebase by throwing out lots of code in MySQL’s string library and replacing them with appropriate libc and C++ alternatives.

You see, back in the 80s MySQL had reinvented a lot of the string functionalities provided by libc for reasons that I do not know (because it was before my time). Turns out that most of the code is still in use today and I guess there was a good reason back in the day but nowadays this doesn’t seem to make much sense, since:

  • Despite the criticisms, glibc works darn well.
  • The priority of optimizing library functions is much higher for standard library developers than it is for you as an application developer.
  • Using the standard library also helps new Drizzle community developers understand the codebase much faster from seeing functions that they are already familiar with.

Arguably, being returned a pointer to the terminating NULL like most of MySQL functions makes string appending slightly easier but if you ask me, many people (including myself) are not comfortable with this and it makes the codebase look weird, IMHO. An example of this is having to rewind the pointer when passing the string to a third-party function.

Benefits gained from narrowing to UTF-8

Because UTF-8 is the prominent encoding in the areas that we are targeting (web and the cloud), currently Drizzle uses only UTF-8 for its internal representation. So needless to say, support for anything other than UTF-8 were thrown out from the library which helped reduce the size of the library greatly.

Interested in how much slimmer the Drizzle string library is compared to the original one in MySQL 5.1? To illustrate the difference, here are the results from counting the files and lines:

$ wc -l mysql-5.1.30/strings/*.c
...
96798 total
 
$ ll mysql-5.1.30/strings/ | wc -l
78
$ wc -l drizzle/mystrings/*.cc
...
24634 total
 
$ ll drizzle/mystrings/ | wc -l
31

AWESOME.

Written by tmaesaka

December 16th, 2008 at 5:03 pm

Posted in drizzle, oss

Tagged with ,

Rethinking the Query Cache for Drizzle

without comments

There is a mutual understanding in the Drizzle community that the MySQL query cache works well for a small database but isn’t sufficient for relatively large scale usages. Does your application involve a lot of database updates? if so, you’ll probably face fragmentation issues in the query cache (though using the query cache isn’t suitable for use cases like this).

Caching is the key ingredient in boosting the performance of any software that requires significant amount of computation, hence it is something that can’t be overlooked. So how can we improve Drizzle?

The idea is to create a pluggable query cache subsystem that can work in a large scale environment. Drizzle, being a micro-kernel DBMS, it makes sense to make the cache component pluggable and let the DBA choose the caching solution of their choice. This is exactly what I’m working on at the moment and my first plugin will allow Drizzle to use memcached as its query cache.

For example, a DBA could hook up their memcached pool to Drizzle and use several gigabytes of fast cache space to cache their results.

Things to consider

  • Does the DBA really want to cache results?
  • Does the result construction take long enough to care?
  • Do we want to specify a specific SQL statement to always cache?
  • Do we want to enforce a certain table to be cached?
  • Transactional Engines

If we can satisfy the above points and achieve modularity, I think its a total win. For those that like diagrams, here is the architecture that is on my mind at the moment:

 Drizzle Query Cache Plugin Example

Benefits of using memcached

memcached is proven to work and help scale web applications in a cost effective fashion by various players in the web industry. It is also fast. The time complexity of fetching a cached result from memcached is O(1), which is an order we all love. Furthermore, by using memcached, the fragmentation issue disappears since this is a problem that the memcached community had to face in the past and successfully overcame by developing the slab subsystem.

Want to scale? with consistent hashing enabled, you can greatly reduce the number of cache misses from adding/removing a node from a live pool. Got spare boxes lying around? hook them up and powerup Drizzle! Need support? both memcached and Drizzle community members are heartwarming people.

Other Solutions work Too!

The beauty of modularity is that you can create and use your own solution for your unique requirements. For example lets assume that there is a webshop that wants to keep the number of physical servers down (e.g. limited monetary/space resource).

To satisfy the requirement stated above, you could cache to a fantastically fast hash database, such as Tokyo Cabinet (much, much faster than BDB). If you haven’t heard of it, you should look at the incredible benchmark comparison). So, what I really wanted to say is that the microkernel property of Drizzle will open up a lot of new possibilities for your application and help you tackle the new requirements that seem to come out of no where.

Where from here?

Currently going through the UDF -> Plugin Architecture conversion done by Mark, and planning on basing the code on his logging plugin while its fantastically simple. My work will be done in:

  • lp:~tmaesaka/drizzle/pluggable-qcache

I’ll hopefully have something decent to show soon, and I will keep people updated on my blog, IRC and the Mailing List (drizzle-discuss).

So that is all I have to say for now… If you have any suggestions, please do enlighten me :)

Written by tmaesaka

October 10th, 2008 at 4:54 pm

Posted in drizzle, memcached, oss

Tagged with , ,

Thoughts on UTF-8 over CJK charsets in Drizzle

with 3 comments

Internally, Drizzle will use UTF-8 everywhere and _only_ UTF-8. This is simply because UTF-8 is the choice of encoding within the Drizzle community at the moment. To me, this decision makes sense since UTF-8 is popular in the areas that Drizzle is targetting (Web and the Cloud). Limiting to UTF-8 also means that the Drizzle codebase would become cleaner, thus easier to maintain. However, there are arguments against it in the community so this could change in the future.

So, what does this mean to those that are outside regions that use latin characters, specifically East Asia? Would this cause an uproar?

Few months ago, Brian Aker had asked me about this and after a brief discussion with Jay Pipes couple of days ago, I figured I should blog about this so I can keep it as a note for myself and hopefully gain feedbacks from those that stumbles across this entry. Here are my thoughts based on my knowledge on the Japanese web industry:

Web Industry Standard in Japan

Looking at the web industry trend in Japan, UTF-8 is becoming the prominent encoding, despite the fact that UTF-8 requires more computation power and space than Japanese CJK charsets. For example, mixi.jp (one of the largest websites in Japan) still uses EUC-JP (CJK family) due to historical reasons but if you look at their newer features like video sharing, you can see that they’ve begun adopting UTF-8. Yahoo! JP, COOKPAD, ja.wikipedia and Livedoor are great examples of large Japanese sites too.

The reason UTF-8 is becoming popular in the .jp domain IMHO is:

  • The default encoding of XHTML is UTF-8/UTF-16
  • All browsers support UTF-8 nowadays (if it doesn’t you shouldn’t be using it)
  • Theoretically, more characters can be represented in UTF-8
  • Theoretically, existing ASCII functions can be used

However, there are certainly cases where web developers might need to use their local encoding for supporting things like mobile devices (Shift-JIS in Japan). These unique requirements IMHO should be handled by the client, such that rather than making DBMS responsible, you should encode the returned result to whatever you like in the application layer before rendering it.

More overhead per character

Using UTF-8 means that there is going to be an estimated average of 1 byte overhead per character (typically an EUC-JP character is 2 bytes), hence if you have a lot of textual data already in either of CJK encodings, you’re definitely going to use more storage (the more data you have, the more significance).

Eating more space may seem significant but to me, whats more significant is the cost reduction in memory and storage mediums nowadays. If you begin facing problems due to having too much data, its probably time to consider horizontal partitioning anyway.

Conclusion

The topic discussed in this entry is very sensitive, and it is merely my personal opinion. Every encoding has its ups and downs like all things (they were designed for a purpose after all) and hence there are numerous amount of people with different opinions. Satisfying everyone is difficult, but who knows? UTF-8 alone may satisfy majority of the users that we are targeting. If it doesn’t then I guess we’ll have to think again… We also need to look into internal sort performance if we go pure UTF-8.

The conclusion Jay and I came up with in our brief discussion was that providing a conversion tool in the Drizzle package could be a good start to get people jumping into the UTF-8 boat. There is no specific plan nor we’ve decided to do this yet but if we were to do it, I’m thinking that the tool can be something simple that uses GNU libiconv.

Hey, there is always the brute solution of storing textual data of your choice in binary ;)

Written by tmaesaka

September 28th, 2008 at 5:24 pm

Drizzle Article in Japanese

without comments

Yesterday, an article I wrote for a fairly large Japanese IT news portal called @IT was made public and I figured I should blog about it in English, so that I can tell my fellow Drizzlers about it. Here is the link to the article even though it is in Nihongo ;)

http://www.atmarkit.co.jp/fdb/rensai/drzl_pj/drzl01.html

This three page multi-byte article starts by covering the concept of how the project was launched by Brian Aker, and the overall concept and philosophy of Drizzle. I then moved on to describing how we are modernizing things, for example adopting the C99 standard, targeting modern hardware (lots and lots of cores) and the microkernel architecture. I also described how we intend on working with other open source communities by actively using open source libraries that are out there, rather than writing our own or use MySQL’s existing libraries.

One of the misunderstandings that came up after the announcement of Drizzle at OSCON was that Drizzle was being compared against SQLite. I was afraid that the same could happen in Japan so I made sure that this misunderstanding wouldn’t happen in my article. If you’re interested in the difference, it is well described in the Drizzle Wiki:

http://drizzle.wikia.com/wiki/Drizzle_compared_with_SQLite

Other than that, I thoroughly explained how we are committed to being open and transparent, hence constantly welcoming people and any suggestions and patches that they might have. Even if you find your suggestion to be something trivial, it could turn out to be a breakthrough for the community.

So the point is, lets all stimulate each other, have fun, and make a great piece of software :)

Written by tmaesaka

September 4th, 2008 at 11:20 pm

Posted in drizzle, oss

Drizzle, out in the open

without comments

So I’ve been fortunate enough to participate in developing Drizzle, which is a microkernel fork of MySQL that you can read more about on Brian Aker’s blog post.

In brief we are getting rid of components that we find unnecessary in MySQL by default, and instead making them optional by refactoring the server to be modular, aka microkernel. Another words, we are trying to develop a lean, fast, simple and extensible RDBMS that would fit well in mid and large scale web applications.

How? well, take Query Cache for example. QC works well in a one-man database but it has very small (if not no) effect when we start thinking big, and especially in the web industry. So why bother keeping it? what would be better is if we could _optionally_ make Drizzle use a cluster of memcached for query caching, which would also allow many database instances to share a common cache. Same things can be said about many other components, such as ACL and Stored Procedures. This is exactly why we are moving to a microkernel architecture. If you want something special, you should be able to customize the server in a relatively easy fashion and satisfy your requirements, rather than having to refactor the server code yourself.

Indeed, not everyone needs a microkernel database, in fact I assume most people won’t. However, there are enough web developers and companies in the small portion of the pie that would love a microkernel database to solve the problems that they are facing today. This is exactly why we don’t consider Drizzle to be a MySQL replacement.

If you’d like more information, do check out our project page on Launchpad and browse through the mailing list archive. Drizzle development is done in a true open source fashion by using open resources and tools like Bazaar and Launchpad. This means that everyone is free to come up with improvement suggestions/patches and submit it to the drizzle community.

Drizzle has been very fun and I thank Brian for getting me involved in such a fun project :)

Btw, I wrote a blog post on Drizzle in Japanese on the Mixi engineering blog too.

Written by tmaesaka

July 23rd, 2008 at 4:03 pm

Posted in drizzle, oss

Tagged with ,