Archive

Posts Tagged ‘gae’

Notes on Loading Data to Google App Engine

June 15th, 2010

Google has a fantastic documentation on this topic but at the time I wrote this blog entry, the documentation covered how to download and upload data using appcfg.py but not with bulkloader.py (there is also bulkload_client.py). So, I decided to play around with the nifty bulkloader and keep a note on my findings.

Prepare the End Point for Loading Data

Loading data to the Data Store is accomplished by sending data to the application over HTTP. This means that your application needs a uniquely identifiable URI for you to send your data to. Creating a valid URI is just a matter of setting up a handler for it in the app.yaml config file. GAE takes care of the import logic with it’s own handler. There’s nothing special in this step and the documentation covers how to do this concisely.

Test Data for Demo Purpose

For this blog entry, I decided to prepare a CSV with four rows that represents users. In reality, there would be more information related to a user but I decided to keep things minimal for this blog entry. I saved this data as user.csv.

1, Daniel, Bernstein, xxxxxxx
2, Donald, Knuth, xxxxxxx
3, Bjarne, Stroustrup, xxxxxxx
4, Robert, Sedgewick, xxxxxxx

You can also represent your table in XML but I decided to use CSV for it’s simplicity.

Create a Bulk Loader Configuration File or Not

In addition to the CSV file, the bulk loader needs to know how each record in the CSV file should be represented as a Data Store entity. The modeling as far as I know can be done in two ways. One is to write a loader class in Python that the bulkloader can use. Another approach is to get bulkloader.py to generate a configuration file (in YAML).

I decided to write my own Python class to get through this step since according to the documentation at the time this blog post was written, this approach doesn’t work with the local development server.

With the above in mind, here is my loader class. You would usually keep the Data Model definition (the User class) in a separate file but for demo purposes, I decided to keep it in one file.

from google.appengine.ext import db
from google.appengine.tools import bulkloader
 
class User(db.Model):
  id = db.IntegerProperty()
  firstname = db.StringProperty()
  lastname = db.StringProperty()
  some_text = db.StringProperty()
 
class UserLoader(bulkloader.Loader):
  def __init__(self):
    bulkloader.Loader.__init__(self, 'User',
                               [('id', int),
                                ('firstname', str),
                                ('lastname', str),
                                ('some_text', str)])
loaders = [UserLoader]

The explanation on what this class does is described in the documentation. I saved this script as user_loader.py.

Load your Data to the Data Store

For demo purposes, I used my local development server on port 8083 to load the CSV file. Given that the application is running and that the API endpoint is active, it’s just a matter of providing bulkloader.py with essential information. For available options I recommend reading help by executing ‘bulkloader.py -h’.

The following command attempts to load our entity of ‘kind=User’ from user.csv using our loader class (user_loader.py) to the endpoint.

$ bulkloader.py --filename=user.csv --config_file=user_loader.py \
--kind=User --url=http://localhost:8083/import --app_id=your_app_id

Note that it’s essential to provide the --app_id option when uploading data to the local server. When asked for credentials, you can type anything you like. You only need to supply valid credentials when uploading to production.

Here’s the output from executing the above command.

[INFO    ] Logging to bulkloader-log-20100615.213842
[INFO    ] Throttling transfers:
[INFO    ] Bandwidth: 250000 bytes/second
[INFO    ] HTTP connections: 8/second
[INFO    ] Entities inserted/fetched/modified: 20/second
[INFO    ] Batch Size: 10
[INFO    ] Opening database: bulkloader-progress-20100615.213842.sql3
Please enter login credentials for localhost
Email: foo
Password for foo: 
[INFO    ] Connecting to localhost:8083/import
[INFO    ] Starting import; maximum 10 entities per post
[INFO    ] 4 entites total, 0 previously transferred
[INFO    ] 4 entities (933 bytes) transferred in 4.0 seconds
[INFO    ] All entities successfully transferred

Success!

Toru Maesaka knowledge, technology , ,

Google App Engine and it’s Memcache API

November 24th, 2008

Google App Engine (GAE) is something I’ve been meaning to look into for personal interest but have been failing to do up until now due to lazyness and being relatively busy.

So specifically, I’m interested in the Datastore API and the Memcache API since well, thats what I do. For those that aren’t familiar with GAE, it is a platform provided by Google that allows you to run your web application on their infrastructure. Using the Google infrastructure is done through a set of provided APIs and they take care of Scaling and HA issues for you. This means you don’t have to invest into hardware (elastic running cost) nor have to repair anything (other than your code of course). So, its a typical example of PaaS.

Taking a look at the Memcache API

Nowadays its gradually becoming common knowledge in the web industry that using memcached can help your site scale and reduce the response time dramatically in a cost-efficient fashion (adding a DB Slave vs memcached node). The question is, what’s behind Google’s Memcache API? On the App Engine documentation, it is only stated that:

The Memcache API has similar features to and is compatible with memcached by Danga Interactive.

So, its actuallly not stated that the backend is powered by memcached despite the name. This means that the backend can be anything like a distributed Google Sparse Hash over the wire. I guess what’s important is not so much the cache daemon but by keeping the interface consistent with memcached, developers that are familiar with memcached can use GAE without allergic reactions. Not to mention, memcached has a brilliant interface for a distributed cache.

Caching your data on GAE is uver simple. You first import the ‘memcache’ module from the GAE package:

from google.appengine.api import memcache

then call the appropriate API method for whatever it is that you want to do.

Just for fun I tried setting a value using a key thats longer than 250 bytes since the maximum length of a key that memcached will accept over the ASCII protocol is 250 bytes (aka 250 ASCII characters). So how about the App Engine?

from google.appengine.api import memcache
 
memcache.flush_all()
test_key = 'x' * 300
 
if not memcache.set(test_key, 'some_val'):
    print 'Failed to set'
    quit()
 
print "Looks like we're good = " + memcache.get(test_key)

Well, turns out this code didn’t run with this error message from my local app server:

Keys may not be more than 250 bytes in length, received 300 bytes

Hehe, this looks very memcached to me but who knows, this could also be deliberate to keep things consistent with memcached.

Memcache API and Datastore API in Action

Okay, so to see if the Memcache API + Datastore API performs just like what you would expect from memcached + MySQL, I wrote a simple GAE Web Application. Here is the sourcecode and screenshots of the application actually running on Google:

gae_memcache_api gae_datastore_api

All it does is, it populates your Cache and Persistent Storage with 64 rows that are 4KB each (so, 256KB in total) and measures how long it takes to bring it over to the application layer. This is obviously not enough to simulate data transfer in a real world web application but I figured its enough to make a point.

So as expected, retrieving data is faster by using the memcache API and in theory this performance should not degrade and run constantly even with increased concurrent connections and requests. On the other hand, performance of the Datastore API _could_ degrade. I’m saying “could” because as much as I’d like to prove this point, I didn’t really want to ab Google.

Btw, after quickly looking at the caching code in the SDK, it seems Memcache is emulated using Python’s Dictionary on the local development environment.

Taking a look into Cached Bytes

Conveniently, the Memcache API provides a simple way to fetch the amount of bytes that is currently being cached for you:

from google.appengine.api import memcache
 
stats = memcache.get_stats()
if stats: print stats['bytes']

Being a curious individual and a great stalker, I decided to use this information to compare whatever it is thats behind the Memcache API with memcached. You see, with memcached you don’t get the exact number of key/value bytes that you sent over the wire because memcached reports the total number of bytes it had consumed, including overheads per item (as it should). In other words, what memcached reports is “unique”.

So, below is what I got from comparing the Memcache API (on Google’s infrastructure) and the latest release of memcached (1.2.6) at the point of this blog entry:

1 x 128 byte value with a 5 byte key
Memcache API: 133 bytes
memcached-1.2.6: 184 bytes

64 x 128 byte values with 5 byte keys
Memcache API: 8512 bytes
memcached-1.2.6: 11776 bytes

128 x 128 byte values with 5 byte keys
Memcache API: 17024 bytes
memcached-1.2.6: 23552 bytes

Wow, according to the above results, Google’s Memcache backend is not showing any overhead in its report. Maybe it is a sparse map over the wire after all. But like I mentioned earlier, it doesn’t really matter what’s behind the API because what’s actually important is that its easy for us end-users to use and that it performs in an O(1) manner.

Conclusion

The Google App Engine Documentation rocks! like I mentioned on Twitter, the team that worked on the documentation should get a medal. It got me started in no time and gave me just enough information to start doing my own thing without getting frustrated from excessive information.

There are still unresolved questions like how sharding works for the Memcache API. I mean, do each application get a dedicated server instance(s) or are keys appended/prepended with an app_id in the background? The latter approach sounds simple and effective but it opens up another question of stats management. I guess a housekeeping index for each application would get around this issue but there is no programmable way from the outside to confirm this.

On a different note, I should stop being a stalker and just enjoy what’s been provided (though this is a really difficult thing to do once you dive into the world of engineering) :)

Toru Maesaka memcached , ,