Home > drizzle, knowledge, oss > Storage Engine Dev Journal #3 : Supporting variable width tables

Storage Engine Dev Journal #3 : Supporting variable width tables

June 16th, 2009

Something I’ve added to BlitzDB recently that was pretty high on my todo list is support for variable width tables. So what is a variable width table? it is a table that contains columns that can vary in size, namely BLOB and TEXT types.

Going back to the basics, when a new row is to be written, a storage engine is given a pointer to the row data in MySQL format that it must somehow store for later lookup/retrieval. By meaning “somehow”, the storage engine is given the freedom to do whatever it likes with the row.

Writing a row for a fixed length table (a table with columns that are always the same size) is deadly easy. A storage engine can choose to not temper with the row and simply write or copy the data to it’s storage mechanism. This is because the storage engine is given a row that contains all the data. Rows for variable width tables however, are treated differently since things aren’t as simple (it’s variable!).

The difference is that columns for BLOB and TEXT types are represented by two parts inside a MySQL/Drizzle row:

  • length of the data
  • pointer to the actual data

This is simple to understand since we need to know the size of the data to copy it.

Minor Complication

The minor complication as you would expect here is that you can’t directly write the provided row to your engine like you can with fixed length tables. The data that you want to copy/write exists elsewhere (hence the pointer) so directly writing the row has no meaning (the data would have disappeared by your next access to that row). You need to make sure that the actual data for BLOB/TEXT column(s) are arranged appropriately on your engine’s row buffer and written out to it’s storage mechanism.

This process is commonly referred to as row packing (converting to your engine format) and unpacking (convert back to MySQL format). So how is this done? it’s actually pretty simple!

The solution is actually simple

As much as it sounds like a bother to support variable length rows, it’s actually not that bad. First you need to understand what a MySQL row looks like internally.

A MySQL row begins with a bitset that represents which fields are NULL. The length of this data obviously depends on the number of NULLable columns you have but this is easy to handle with Drizzle since we’re given all the relevant information by the TableShare object (same goes for MySQL from a different object).

After this data comes the actual column data in the order that appears in your CREATE TABLE statement. What you need to do to get packing working with this row is the not-so-obvious part that you really need an example to look at. Fortunately Tweeting about this attracted Brian’s attention which helped me move forward.

Loop the fields!

So, let’s take row insertion to a variable width table as an example. Imagine this table:

CREATE TABLE t1 (
  id int PRIMARY KEY NOT NULL,
  description text,
  arbitrary_data blob
) engine=your_engine;

and let’s imagine that we need to process this query:

INSERT INTO t1 VALUES (1, "hello world", "blobbbbb");

Now, the storage engine needs to “pack” the data for each column into it’s buffer in the write_row() function. Conveniently, Drizzle/MySQL provides a pack() function for it’s column types (fields) that will do the data packing for you. That is, you do not have to inspect the provided row for pointers to the actual data and do the packing/copying yourself.

How? well, the table object (which is visible from your engine) conveniently holds a list of fields in the appropriate order. The actual pack() function is a member of these fields so you just need to call it as you loop over the list:

/* make sure row_buffer has enough memory */
unsigned char *pos = row_buffer;
 
/* copy NULL bits, "table->s" is the TableShare object */
memcpy(pos, row, table->s->null_bytes);
pos += table->s->null_bytes;
 
/* "row" is the MySQL formatted row given by the core */
for (Field **field = table->field; *field; field++) {
  if (!((*field)->is_null()))
    pos = (*field)->pack(pos, row + (*field)->offset(row));
}

The above code snippet will populate “row_buffer” with the actual data that you want to write to your storage mechanism. You do not have to forward the “pos” pointer because pack() returns a pointer at the end of where it had worked in the buffer (think Pascal Strings). This is precisely why we created the pos pointer, to avoid row_buffer from being forwarded.

For the opposite situation (when retrieving a row), an unpack() function is provided for each field so you just need to take advantage of it like we did with the pack() snippet above.

Little bit more on fields

The actual pack() function that gets called depends on the type of column since the Field class is an abstract base class for the sub classes that actually represents column types inside Drizzle/MySQL. If you want to know what a pack() function looks like for a BLOB type, grep for “Field_blob” in the source tree and there will be a pack() member function for it.

The code layout for field subsystem in MySQL is rather difficult to comprehend since everything is crammed in “sql/field.c” and “sql/field.h” files (at least as of 5.4). So, if you want to get a good grasp of how things are architectured, you should take a look at Drizzle. Field subclasses are located individually in the “drizzled/field/” directory and the base class is located in “drizzled/field.h”.

So, that’s about it! Hopefully this information will help other engine developers when they come across a need to support variable width tables :)

Toru Maesaka drizzle, knowledge, oss , , ,

  1. No comments yet.
  1. No trackbacks yet.