Toru Maesaka

Web addict and a hackaholic based in Tokyo

Archive for the ‘drizzle’ tag

Rethinking the Query Cache for Drizzle

without comments

There is a mutual understanding in the Drizzle community that the MySQL query cache works well for a small database but isn’t sufficient for relatively large scale usages. Does your application involve a lot of database updates? if so, you’ll probably face fragmentation issues in the query cache (though using the query cache isn’t suitable for use cases like this).

Caching is the key ingredient in boosting the performance of any software that requires significant amount of computation, hence it is something that can’t be overlooked. So how can we improve Drizzle?

The idea is to create a pluggable query cache subsystem that can work in a large scale environment. Drizzle, being a micro-kernel DBMS, it makes sense to make the cache component pluggable and let the DBA choose the caching solution of their choice. This is exactly what I’m working on at the moment and my first plugin will allow Drizzle to use memcached as its query cache.

For example, a DBA could hook up their memcached pool to Drizzle and use several gigabytes of fast cache space to cache their results.

Things to consider

  • Does the DBA really want to cache results?
  • Does the result construction take long enough to care?
  • Do we want to specify a specific SQL statement to always cache?
  • Do we want to enforce a certain table to be cached?
  • Transactional Engines

If we can satisfy the above points and achieve modularity, I think its a total win. For those that like diagrams, here is the architecture that is on my mind at the moment:

 Drizzle Query Cache Plugin Example

Benefits of using memcached

memcached is proven to work and help scale web applications in a cost effective fashion by various players in the web industry. It is also fast. The time complexity of fetching a cached result from memcached is O(1), which is an order we all love. Furthermore, by using memcached, the fragmentation issue disappears since this is a problem that the memcached community had to face in the past and successfully overcame by developing the slab subsystem.

Want to scale? with consistent hashing enabled, you can greatly reduce the number of cache misses from adding/removing a node from a live pool. Got spare boxes lying around? hook them up and powerup Drizzle! Need support? both memcached and Drizzle community members are heartwarming people.

Other Solutions work Too!

The beauty of modularity is that you can create and use your own solution for your unique requirements. For example lets assume that there is a webshop that wants to keep the number of physical servers down (e.g. limited monetary/space resource).

To satisfy the requirement stated above, you could cache to a fantastically fast hash database, such as Tokyo Cabinet (much, much faster than BDB). If you haven’t heard of it, you should look at the incredible benchmark comparison). So, what I really wanted to say is that the microkernel property of Drizzle will open up a lot of new possibilities for your application and help you tackle the new requirements that seem to come out of no where.

Where from here?

Currently going through the UDF -> Plugin Architecture conversion done by Mark, and planning on basing the code on his logging plugin while its fantastically simple. My work will be done in:

  • lp:~tmaesaka/drizzle/pluggable-qcache

I’ll hopefully have something decent to show soon, and I will keep people updated on my blog, IRC and the Mailing List (drizzle-discuss).

So that is all I have to say for now… If you have any suggestions, please do enlighten me :)

Written by tmaesaka

October 10th, 2008 at 4:54 pm

Posted in drizzle, memcached, oss

Tagged with , ,

Thoughts on UTF-8 over CJK charsets in Drizzle

with 3 comments

Internally, Drizzle will use UTF-8 everywhere and _only_ UTF-8. This is simply because UTF-8 is the choice of encoding within the Drizzle community at the moment. To me, this decision makes sense since UTF-8 is popular in the areas that Drizzle is targetting (Web and the Cloud). Limiting to UTF-8 also means that the Drizzle codebase would become cleaner, thus easier to maintain. However, there are arguments against it in the community so this could change in the future.

So, what does this mean to those that are outside regions that use latin characters, specifically East Asia? Would this cause an uproar?

Few months ago, Brian Aker had asked me about this and after a brief discussion with Jay Pipes couple of days ago, I figured I should blog about this so I can keep it as a note for myself and hopefully gain feedbacks from those that stumbles across this entry. Here are my thoughts based on my knowledge on the Japanese web industry:

Web Industry Standard in Japan

Looking at the web industry trend in Japan, UTF-8 is becoming the prominent encoding, despite the fact that UTF-8 requires more computation power and space than Japanese CJK charsets. For example, mixi.jp (one of the largest websites in Japan) still uses EUC-JP (CJK family) due to historical reasons but if you look at their newer features like video sharing, you can see that they’ve begun adopting UTF-8. Yahoo! JP, COOKPAD, ja.wikipedia and Livedoor are great examples of large Japanese sites too.

The reason UTF-8 is becoming popular in the .jp domain IMHO is:

  • The default encoding of XHTML is UTF-8/UTF-16
  • All browsers support UTF-8 nowadays (if it doesn’t you shouldn’t be using it)
  • Theoretically, more characters can be represented in UTF-8
  • Theoretically, existing ASCII functions can be used

However, there are certainly cases where web developers might need to use their local encoding for supporting things like mobile devices (Shift-JIS in Japan). These unique requirements IMHO should be handled by the client, such that rather than making DBMS responsible, you should encode the returned result to whatever you like in the application layer before rendering it.

More overhead per character

Using UTF-8 means that there is going to be an estimated average of 1 byte overhead per character (typically an EUC-JP character is 2 bytes), hence if you have a lot of textual data already in either of CJK encodings, you’re definitely going to use more storage (the more data you have, the more significance).

Eating more space may seem significant but to me, whats more significant is the cost reduction in memory and storage mediums nowadays. If you begin facing problems due to having too much data, its probably time to consider horizontal partitioning anyway.

Conclusion

The topic discussed in this entry is very sensitive, and it is merely my personal opinion. Every encoding has its ups and downs like all things (they were designed for a purpose after all) and hence there are numerous amount of people with different opinions. Satisfying everyone is difficult, but who knows? UTF-8 alone may satisfy majority of the users that we are targeting. If it doesn’t then I guess we’ll have to think again… We also need to look into internal sort performance if we go pure UTF-8.

The conclusion Jay and I came up with in our brief discussion was that providing a conversion tool in the Drizzle package could be a good start to get people jumping into the UTF-8 boat. There is no specific plan nor we’ve decided to do this yet but if we were to do it, I’m thinking that the tool can be something simple that uses GNU libiconv.

Hey, there is always the brute solution of storing textual data of your choice in binary ;)

Written by tmaesaka

September 28th, 2008 at 5:24 pm

Mac OS X, Ubuntu and Drizzle

without comments

So admittedly, Mac OS X is currently not the most friendly platform to work with Drizzle, mostly due to library issues.

OS X has several weird hacks in it due to licensing issues (libreadline comes into mind first). Sure, MacPorts, Darwin Ports and etc could get around this problem but should this be necessary? Personally I dislike resorting to these solutions. Fortunately I’ve been doing all my Drizzle work with Ubuntu on a dedicated server so I’ve yet to come across any build related issues. However, it kind of sucks not to be able to take my Mac out to a cafe in the weekend and work there without connectivity.

So to make my life happier, I installed Ubuntu on my MacBook Pro (alongside OS X of course).

I came across few problems like corrupted partition table in the process of getting Ubuntu working but the following Ubuntu threads helped greatly:

General Instructions

Boot related problems when using Hardy Heron (Ubuntu 8.04)

You know, getting Ubuntu running on my Mac was entertaining since I was talking to Monty Taylor about his thoughts on how using a Mac is selling out yesterday. Now what does this make me now?

Happy Hacking :)

Written by tmaesaka

July 30th, 2008 at 12:00 pm

Posted in oss

Tagged with ,

Drizzle, out in the open

without comments

So I’ve been fortunate enough to participate in developing Drizzle, which is a microkernel fork of MySQL that you can read more about on Brian Aker’s blog post.

In brief we are getting rid of components that we find unnecessary in MySQL by default, and instead making them optional by refactoring the server to be modular, aka microkernel. Another words, we are trying to develop a lean, fast, simple and extensible RDBMS that would fit well in mid and large scale web applications.

How? well, take Query Cache for example. QC works well in a one-man database but it has very small (if not no) effect when we start thinking big, and especially in the web industry. So why bother keeping it? what would be better is if we could _optionally_ make Drizzle use a cluster of memcached for query caching, which would also allow many database instances to share a common cache. Same things can be said about many other components, such as ACL and Stored Procedures. This is exactly why we are moving to a microkernel architecture. If you want something special, you should be able to customize the server in a relatively easy fashion and satisfy your requirements, rather than having to refactor the server code yourself.

Indeed, not everyone needs a microkernel database, in fact I assume most people won’t. However, there are enough web developers and companies in the small portion of the pie that would love a microkernel database to solve the problems that they are facing today. This is exactly why we don’t consider Drizzle to be a MySQL replacement.

If you’d like more information, do check out our project page on Launchpad and browse through the mailing list archive. Drizzle development is done in a true open source fashion by using open resources and tools like Bazaar and Launchpad. This means that everyone is free to come up with improvement suggestions/patches and submit it to the drizzle community.

Drizzle has been very fun and I thank Brian for getting me involved in such a fun project :)

Btw, I wrote a blog post on Drizzle in Japanese on the Mixi engineering blog too.

Written by tmaesaka

July 23rd, 2008 at 4:03 pm

Posted in drizzle, oss

Tagged with ,