John Mastro: Sky 0.0.2: Immediate integers and characters

First of all, I apologize if you’re subscribed to the feed and are seeing duplicates of the last two posts. In both cases I changed things after publication that resulted in a different feed entry ID, causing many feed readers to treat the changed entry as a new entry. I think I have things sorted out well enough now that it won’t be an issue going forward.

Last time we wrote constructors and accessors for a handful of basic data types, all of them dynamically allocated in the heap (as opposed to on the stack). This time we’re going to change the representation of integers and characters so that they’re immediate values and don’t require dynamic allocation. You can see the code as of this post on the sky-0.0.2 tag, and the log of changes since the first time here.

The technique we’ll use to do this is widely used in programming language implementations, operating systems, and “low level” software more generally. It starts from the observation that the memory returned by functions like malloc must be suitably aligned for any (C) object.

As of C11, there’s a type max_align_t (defined in stddef.h) whose alignment is at least as large as that of every scalar type, plus an operator alignof (defined in stdalign.h) which will tell us the alignment requirement of a given type. We can use them to write a small test program that will tell us what alignment we can count on from malloc:

#include <stdio.h>
#include <stddef.h>
#include <stdalign.h>

int main(void)
{
    printf("alignof(max_align_t) = %zu\n", alignof(max_align_t));
    return 0;
}

In practice, the answer on a 64-bit machine will be 16 (bytes). The upshot is that memory addresses returned by functions like malloc will always be divisible by 16. Put another way, the 4 least significant bits of such memory addresses will always be zero (because 16 is 2^4). On a 32-bit machine, the minimum alignment will be 8 bytes, making the 3 least significant bits zero. You can also requested more highly-aligned memory using special allocation functions.

The bottom line is that at least the three least significant bits of our struct object * pointers will always be zero¹, which means we can use those bits for something else. If we use all three bits we can store 8 possible values there, but for now we’ll only use two bits, for four possible values. We’ll use them to store a type tag, with the following scheme:

Binary | Decimal | Type
-------|---------|-----------
    00 |       0 | Pointer
    01 |       1 | Integer
    10 |       2 | Character
    11 |       3 | Unused

If the two least significant bits are 00, the value represents a pointer (specifically, a struct object *). If they’re 01 the object is an integer, if they’re 10 it’s a character, and we’ll leave 11 unused for now.

Taking a step back for a moment, while we could tag the struct object * values directly, things will be cleaner and clearer if we instead introduce a C type that’s either a pointer or an immediate value. We introduce value_t to fill that roll:

typedef intptr_t value_t;

It’s a typedef of intptr_t, which is a signed integer large enough to hold a pointer. The downside is a bunch of extra casts, but they’ll be well-contained in the constructors and accesssors.

Okay, so how do we construct one of these tagged objects? We shift the value left by two bits and then add the tag.

#define TAG_BITS 2
#define PTR_BITS (sizeof(intptr_t) * CHAR_BIT)
#define VAL_BITS (PTR_BITS - TAG_BITS)
#define VAL_MASK (-(1 << TAG_BITS))

value_t make_integer(intptr_t value)
{
    return ((uintptr_t)value << TAG_BITS) + TAG_INT;
}

And then to get the original value back we simply shift right by two bits.

intptr_t integer_data(value_t value)
{
    return value >> TAG_BITS;
}

Of course we also need to teach get_type_tag about these immediate values.

enum type_tag get_type_tag(value_t value)
{
    if (value == NIL) return TAG_LIST;

    enum type_tag tag = value & ~VAL_MASK;

    if (!tag)
        tag = ((struct object *)value)->tag;

    return tag;
}

If your background is in higher-level dynamic programming languages like Python, you may not be very familiar with these bitwise operations. I wasn’t, anyway, until I started poking around in programming language implementations. My advice is to read up on them a bit and then spend some time playing around in GDB. Conveniently, you can use print /t (which can be abbreviated to p/t) to print the binary representation of a value, which makes it easy to see the effect of bitwise operations.

Here’s an example session experimenting with the bitwise operations we use to tag and untag an integer:

(gdb) set $i = 123
(gdb) p/t $i
$1 = 1111011
(gdb) p/t $i << 2
$2 = 111101100
(gdb) set $t = ($i << 2) + 1
(gdb) p/t $t
$3 = 111101101
(gdb) p/t $t >> 2
$4 = 1111011
(gdb) p $t >> 2
$5 = 123

Since we’re using two bits for the type tag, we only have either 62 or 30 bits left over for integer values. We can define preprocessor macros to identify the largest and smallest values we can represent.

#define MOST_POSITIVE_INT (INTPTR_MAX >> TAG_BITS)
#define MOST_NEGATIVE_INT (-1 - MOST_POSITIVE_INT)

We’re not going to do any error checking in make_integer yet, but eventually we’ll want to signal in error if an operation would result in an integer too large for us to represent. (In a production language, you would hopefully promote it to a “bignum” - an arbitrary precision integer - but we probably won’t go there).

For objects that will still be dynamically allocated on the heap, the only change is that we cast the struct object * to value_t when we return it, and then in the accessors cast back to struct object * so we can dereference it.

value_t make_string(const char *data, ptrdiff_t len)
{
    // Construct the `struct object *` the same as before
    return (value_t)obj;
}

const unsigned char *string_data(value_t value)
{
    return ((struct object *)value)->u.string.data;
}

To sum things up, we now have the following arrangement: At the C level, a Sky value is represented by a value_t. Each value_t is either an immediate value (an integer or character), or a struct object *. To access the immediate value, we untag it; to access a struct object, we perform the cast and dereference the pointer.

You can take this technique a lot further than we have here. We could get rid of the tag member of struct object entirely and tag all pointers rather than just immediate values. However, my hope is that this more minimal version gets us the low-hanging fruit and demonstrates the technique while keeping things as clear as possible.

I am ignoring sub-32 bit machines, which I think is reasonable in this context. ↩︎