Binary data version migration

By Macoy Madson. Published on .

I have been working on a voxel-based game for the last several months. As part of its development, I created a binary versioning and migration system that I haven't seen before.

Game data goals

Arguably the most important aspect of a game's technical architecture is its data model. The choice of the data model has wide-ranging impacts on iteration speed, development time, game design, player experience, performance, and more.

I had several goals in mind when deciding this game's data model:

The plain old data model

The game's data model is highly unusual for one made in 2023:

There are several advantages to this approach:

There are of course disadvantages:

Versioning plain old data

Data models can be simple right until the point where you need to reckon with changes to the data model "schema" over time.

There are several C++ projects to handle structure to binary serialization and structure versioning (list not exhaustive):

I'm not interested in hearing about more of these, because I'm only using Cakelisp and some third-party libraries in exclusively C. In addition, these all fail one of my main constraints, which is that the serialization "schema" must not be separate from the structure declaration.

Luckily, Cakelisp is designed to implement these sorts of things through full-power macro execution at compile time. Similar to my introspection system, binary versioning is implemented through a collection of macros and compile-time code generation phases.

I cooked up my own binary versioning scheme. Here's the game's world-state-data:

(def-versioned-struct world-state-data (version 7)
  space voxel-space (live (1 1 .))
  chunks (array 1024 voxel-chunk) (live (1 1 .))
  players (array 16 player-data) (live (1 1 .))
  accounts (array 16 player-account) (live (1 1 .))
  characters (array 1024 character) (live (1 4 .))
  character-profiles (array 1024 character-profile) (live (1 4 .))
  doors (array 1024 door-data) (live (1 2 2) (2 3 4) (3 5 5) (4 6 .))
  rooms (array 1024 room-data) (live (1 7 .))
  world-time world-time-data (live (1 1 .)))

This looks the same as a regular struct in Cakelisp, only with another expression after the field's type. This expression is the field's "versionings".

To explain, let's take one field, chunks (array 1024 voxel-chunk) (live (1 1 .)):

The versioning format is live or dead followed by a list of versions (for structs), which look like (sub-struct-version start-version end-version), or simply a range of versions start-version end-version for plain-old data.

The doors field shows more complex history. (live (1 2 2) (2 3 4) (3 5 5) (4 6 .)) indicates that the door-data schema at version 1 was valid in the world-state-data at version 2, but door-data at version 2 was used instead in world-state-data versions 3 through 4 inclusive. You should be able to figure out the rest of the versionings from there.

Within door-data it is clear why its version has changed over time:

(def-versioned-struct door-data (version 4)
  dead-type uint32_t (dead 3 3)
  dead-position fixed-vec3 (dead 1 3)
  position voxel-position (live (1 4 .))
  type door-type (live 4)
  orientation cardinal-orientation (live 2)
  is-open bool (live 1))

In version 4 I removed dead-type and dead-position, added position, and added type. I actually changed the type of position from fixed-vec3 to voxel-position, but the easiest way to do this was to create a new field and rename the old one.

Virtual types

By including this version information per each field, we can construct the type at any version in its history. I call this "virtual types", because all C knows is the "real" type used at runtime, i.e. the struct with only live fields.

Here's door-data expanded into all of its virtual types:

(def-versioned-struct door-data (version 1)
  position fixed-vec3
  is-open bool)

(def-versioned-struct door-data (version 2)
  position fixed-vec3
  orientation cardinal-orientation
  is-open bool)

(def-versioned-struct door-data (version 3)
  type uint32_t
  position fixed-vec3
  orientation cardinal-orientation
  is-open bool)

(def-versioned-struct door-data (version 4)
  position voxel-position
  type door-type
  orientation cardinal-orientation
  is-open bool)

These definitions do not exist to the compiler except the latest one (in this case, version 4). Instead, when data is detected in an older version (via version header), the virtual type is constructed from reading the version log and adding fields one-by-one in order which were live in that version.

Construction of virtual types is possible through following C compiler struct padding and alignment rules. As part of validation, code is generated to check at runtime where the compiler says fields are vs. where the versionings think they are in order to catch any differences between platforms or compilers.

Migration

The first use-case this system was applied to was the game's on-disk save files. I needed to be able to add and remove fields during development without invalidating my previous save files. This becomes critical when the game goes live because any invalidation after live would destroy the player's save files. I know I'll need to modify the game after this point, so I might as well bite the bullet and get it working early.

The save file format is simple:

This makes saving the game state the following steps:

It can hardly get simpler than this, and the only gains I can see for speed would be to use newer I/O interfaces rather than a blocking fwrite.

Loading is a bit more complex due to data migration. First, the magic number is checked to ensure this is a Cakelisp versioned binary file. Next, the version number is checked. If the version matches the runtime version, the data is simply fread into memory. This is the optimal loading case.

If the version does not match, migration starts. If the data version is newer than the runtime version, the file must have been created with a newer version, and cannot be read. If the data version is older, we lazily construct the virtual type at that version. We then virtually construct the next version after the disk version, migrating field by field until we reach the runtime version. Two buffers are needed so we can go from one version to another, back and forth until we reach the target version.

There are four field changes that can occur:

Custom migration handlers provide the programmer with the ability to change types, reorganize data, etc. as desired without losing data. Because they are arbitrarily complex, they can complete any migration necessary, and could even prompt the user if they need to be involved.

Limitations

There are some constraints with using this system:

There are of course assumptions made that allow this setup to work:

These, I feel, are safe assumptions when targeting AMD64 architecture machines. I figure I can worry about compatibility with other architectures when I actually need to ship on one.

Because the schema is part of the code, endianness and even size conversions can happen automatically at the migration stage. This means even if I do end up having to break these assumptions later, I will have enough information to convert existing data.

Credit where credit is due

My friend Eric Alzheimer originally came up with the idea to include the live and dead log in-line and to use it for automated data migration. This insight came from his reading of database research. In his own words (personal correspondence):

So, to give full credit:

The paper that I got most of the idea from is "A Model for Schema Versioning in Temporal Database Systems" (Roddick, John F, 2000), which introduces the idea of a database that stores a "Completed Schema" which is the temporal union of all schemas the database has ever had. In the database system, schemas are always updated via queries, allowing the database to keep track of liveness intervals, etc, internally. Since I had text instead, I added the live/dead syntax. That was based on some stuff Jon Blow brainstormed for JAI. See the bottom of Jai Primer.

The referenced code example from Jai Primer:

Entity_Substance :: struct { @version 37
    flags: int;
    scale_color: Vector4;   @15
    spike_flags: int;       @[10, 36]
}

Eric continues:

Part of why I chose the solution is because the completed schema approach + using runtime reflection for everything meant that the code generation overhead of the versioning system ends up being just a single additional struct definition per struct, and no actual code is generated at all.

I actually built a previous struct versioning system at my prior job and for that I had it generate C functions for everything, but C++ has so many potential compilation time foot-guns I try to avoid generating too much code.

You made the right choice keeping everything as flat structs so you can just fwrite(). [A data serialization system we collaborated on] would be that way but for Vector and Map etc, which I would personally rather replace with static sized alternatives; but I worry people will get annoyed by their absence.

Eric's idea for migration handlers was to create a super structure which is a union of all fields in all versions of the structure, then pass that to the hand-written handler functions. I decided to instead to virtualize the structure via reading the versionings at runtime because it was simpler to write the code generator for and generates less code. Instead of generating the super structure, I generate an array of versionings with the following data:

(defstruct field-version-data
  field-name-crc32 uint32_t
  core-type-crc32 uint32_t ;; struct-name-crc32 for substruct types
  core-type-version (unsigned short) ;; 0 for basic types
  start-version (unsigned short)
  end-version (unsigned short) ;; 0 for live fields
  ;; When migrating, we are working with completely virtualized types built implicitly through
  ;; version history, which means we won't know the offset until we compute it based on previous
  ;; fields live in the current version, so we need the alignment to properly determine field
  ;; placement in memory.
  alignment (unsigned short)
  element-size (unsigned int)
  array-count (unsigned int))

Here's door-data, for example:

FieldVersionData doorDataVersionings[] = {
    {1440875332, 0, 0, 3, 3, __alignof__(uint32_t), sizeof(uint32_t), 0},
    {1747541704, 0, 0, 1, 3, __alignof__(FixedVec3), sizeof(FixedVec3), 0},
    {1177347317, 1916533177, 1, 4, 0, -1, -1, 0},
    {2363381545, 0, 0, 4, 0, __alignof__(DoorType), sizeof(DoorType), 0},
    {914408790, 0, 0, 2, 0, __alignof__(CardinalOrientation),
     sizeof(CardinalOrientation), 0},
    {2995854890, 0, 0, 1, 0, __alignof__(bool), sizeof(bool), 0}};

(The __alignof__ is compiler-specific; in this case the target is GCC, whereas on MSVC __alignof would be emitted at code generation time.)

Note the -1 values for the field with core-type-crc32 populated. These are undefined because that core type's size is difficult to determine at compile time without knowing the size and alignment of all C types, which is something that Cakelisp currently doesn't know. This is harder than one might think to fix, because Cakelisp allows direct inclusion of C headers. I would need to write a C parser which matched all C compilers to determine sizes at Cakelisp code-generation time, which is out of scope for the language. As a side note, this is a problem inherit to all languages which are designed to interoperate with C without any bindings, e.g. Zig (which takes the C parsing approach). To side-step this problem, I run a fix-up function at program startup that fills in those fields with the appropriate sizes given by the C compiler.

What's next

Padding and the use of POD arrays means that there are a whole lot of zeros in the binary data on disk. It compresses extremely well, so I want to eventually start compressing it once the trade-off of write/read speed vs. compression/decompression speed favors compression.

I need to create a blend of binary versioning and my previous introspection system. The former is focused on getting data ready to go fast, whereas the latter is focused on allowing for various per-field processing on the data at runtime (e.g. automatically creating editors or watching for and describing field-by-field changes).

Writing custom migration handlers is a bit cumbersome. I should be able to come up with a variety of generic conversions (e.g. unsigned char to unsigned int).

I also want to support increasing/decreasing array sizes without creating new fields. This is clearly possible if you think of the versioning data as a log of changes to the schema. For example, an entry like (resize 2 256 1024) would be saying "from version 2 this field changed from size 256 to size 1024". The migration code would see that and simply copy 256 elements from version 2 into the 1024 element array in version 3.

Conclusion

The complete system is available here, pinned to a version that will make sense relative to this article.

I'd love to hear your thoughts on this approach to data migration.