Google App Engine and it’s Memcache API
Google App Engine (GAE) is something I’ve been meaning to look into for personal interest but have been failing to do up until now due to lazyness and being relatively busy.
So specifically, I’m interested in the Datastore API and the Memcache API since well, thats what I do. For those that aren’t familiar with GAE, it is a platform provided by Google that allows you to run your web application on their infrastructure. Using the Google infrastructure is done through a set of provided APIs and they take care of Scaling and HA issues for you. This means you don’t have to invest into hardware (elastic running cost) nor have to repair anything (other than your code of course). So, its a typical example of PaaS.
Taking a look at the Memcache API
Nowadays its gradually becoming common knowledge in the web industry that using memcached can help your site scale and reduce the response time dramatically in a cost-efficient fashion (adding a DB Slave vs memcached node). The question is, what’s behind Google’s Memcache API? On the App Engine documentation, it is only stated that:
The Memcache API has similar features to and is compatible with memcached by Danga Interactive.
So, its actuallly not stated that the backend is powered by memcached despite the name. This means that the backend can be anything like a distributed Google Sparse Hash over the wire. I guess what’s important is not so much the cache daemon but by keeping the interface consistent with memcached, developers that are familiar with memcached can use GAE without allergic reactions. Not to mention, memcached has a brilliant interface for a distributed cache.
Caching your data on GAE is uver simple. You first import the ‘memcache’ module from the GAE package:
from google.appengine.api import memcache
then call the appropriate API method for whatever it is that you want to do.
Just for fun I tried setting a value using a key thats longer than 250 bytes since the maximum length of a key that memcached will accept over the ASCII protocol is 250 bytes (aka 250 ASCII characters). So how about the App Engine?
from google.appengine.api import memcache memcache.flush_all() test_key = 'x' * 300 if not memcache.set(test_key, 'some_val'): print 'Failed to set' quit() print "Looks like we're good = " + memcache.get(test_key)
Well, turns out this code didn’t run with this error message from my local app server:
Keys may not be more than 250 bytes in length, received 300 bytes
Hehe, this looks very memcached to me but who knows, this could also be deliberate to keep things consistent with memcached.
Memcache API and Datastore API in Action
Okay, so to see if the Memcache API + Datastore API performs just like what you would expect from memcached + MySQL, I wrote a simple GAE Web Application. Here is the sourcecode and screenshots of the application actually running on Google:
All it does is, it populates your Cache and Persistent Storage with 64 rows that are 4KB each (so, 256KB in total) and measures how long it takes to bring it over to the application layer. This is obviously not enough to simulate data transfer in a real world web application but I figured its enough to make a point.
So as expected, retrieving data is faster by using the memcache API and in theory this performance should not degrade and run constantly even with increased concurrent connections and requests. On the other hand, performance of the Datastore API _could_ degrade. I’m saying “could” because as much as I’d like to prove this point, I didn’t really want to ab Google.
Btw, after quickly looking at the caching code in the SDK, it seems Memcache is emulated using Python’s Dictionary on the local development environment.
Taking a look into Cached Bytes
Conveniently, the Memcache API provides a simple way to fetch the amount of bytes that is currently being cached for you:
from google.appengine.api import memcache stats = memcache.get_stats() if stats: print stats['bytes']
Being a curious individual and a great stalker, I decided to use this information to compare whatever it is thats behind the Memcache API with memcached. You see, with memcached you don’t get the exact number of key/value bytes that you sent over the wire because memcached reports the total number of bytes it had consumed, including overheads per item (as it should). In other words, what memcached reports is “unique”.
So, below is what I got from comparing the Memcache API (on Google’s infrastructure) and the latest release of memcached (1.2.6) at the point of this blog entry:
Memcache API: 133 bytes
memcached-1.2.6: 184 bytes
64 x 128 byte values with 5 byte keys
Memcache API: 8512 bytes
memcached-1.2.6: 11776 bytes
128 x 128 byte values with 5 byte keys
Memcache API: 17024 bytes
memcached-1.2.6: 23552 bytes
Wow, according to the above results, Google’s Memcache backend is not showing any overhead in its report. Maybe it is a sparse map over the wire after all. But like I mentioned earlier, it doesn’t really matter what’s behind the API because what’s actually important is that its easy for us end-users to use and that it performs in an O(1) manner.
Conclusion
The Google App Engine Documentation rocks! like I mentioned on Twitter, the team that worked on the documentation should get a medal. It got me started in no time and gave me just enough information to start doing my own thing without getting frustrated from excessive information.
There are still unresolved questions like how sharding works for the Memcache API. I mean, do each application get a dedicated server instance(s) or are keys appended/prepended with an app_id in the background? The latter approach sounds simple and effective but it opens up another question of stats management. I guess a housekeeping index for each application would get around this issue but there is no programmable way from the outside to confirm this.
On a different note, I should stop being a stalker and just enjoy what’s been provided (though this is a really difficult thing to do once you dive into the world of engineering) :)


