Archive

Posts Tagged ‘charset’

Thoughts on UTF-8 over CJK charsets in Drizzle

September 28th, 2008

Internally, Drizzle will use UTF-8 everywhere and _only_ UTF-8. This is simply because UTF-8 is the choice of encoding within the Drizzle community at the moment. To me, this decision makes sense since UTF-8 is popular in the areas that Drizzle is targetting (Web and the Cloud). Limiting to UTF-8 also means that the Drizzle codebase would become cleaner, thus easier to maintain. However, there are arguments against it in the community so this could change in the future.

So, what does this mean to those that are outside regions that use latin characters, specifically East Asia? Would this cause an uproar?

Few months ago, Brian Aker had asked me about this and after a brief discussion with Jay Pipes couple of days ago, I figuredĀ I should blog about this so I can keep it as a note for myself and hopefully gain feedbacks from those that stumbles across this entry. Here are my thoughts based on my knowledge on the Japanese web industry:

Web Industry Standard in Japan

Looking at the web industry trend in Japan, UTF-8 is becoming the prominent encoding, despite the fact that UTF-8 requires more computation power and space than Japanese CJK charsets. For example, mixi.jp (one of the largest websites in Japan) still uses EUC-JP (CJK family) due to historical reasons but if you look at their newer features like video sharing, you can see that they’ve begun adopting UTF-8. Yahoo! JP, COOKPAD, ja.wikipedia and Livedoor are great examples of large Japanese sites too.

The reason UTF-8 is becoming popular in the .jp domain IMHO is:

  • The default encoding of XHTML is UTF-8/UTF-16
  • All browsers support UTF-8 nowadays (if it doesn’t you shouldn’t be using it)
  • Theoretically, more characters can be represented in UTF-8
  • Theoretically, existing ASCII functions can be used

However, there are certainly cases where web developers might need to use their local encoding for supporting things like mobile devices (Shift-JIS in Japan). These unique requirements IMHO should be handled by the client, such that rather than making DBMS responsible, you should encode the returned result to whatever you like in the application layer before rendering it.

More overhead per character

Using UTF-8 means that there is going to be an estimated average of 1 byte overhead per character (typically an EUC-JP character is 2 bytes), hence if you have a lot of textual data already in either of CJK encodings, you’re definitely going to use more storage (the more data you have, the more significance).

Eating more space may seem significant but to me, whats more significant is the cost reduction in memory and storage mediums nowadays. If you begin facing problems due to having too much data, its probably time to consider horizontal partitioning anyway.

Conclusion

The topic discussed in this entry is very sensitive, and it is merely my personal opinion. Every encoding has its ups and downs like all things (they were designed for a purpose after all) and hence there are numerous amount of people with different opinions. Satisfying everyone is difficult, but who knows? UTF-8 alone may satisfy majority of the users that we are targeting. If it doesn’t then I guess we’ll have to think again… We also need to look into internal sort performance if we go pure UTF-8.

The conclusion Jay and I came up with in our brief discussion was that providing a conversion tool in the Drizzle package could be a good start to get people jumping into the UTF-8 boat. There is no specific plan nor we’ve decided to do this yet but if we were to do it, I’m thinking that the tool can be something simple that uses GNU libiconv.

Hey, there is always the brute solution of storing textual data of your choice in binary ;)

Toru Maesaka drizzle, oss , , ,