John Mastro: Sky 0.0.1: Basic data model

This is the first post in what will be a series of posts about implementing an interpreter for a small Lisp-like language, which I’m calling Sky. I’ve put up a page here with more information about my goals for this project. The code as of this post is tagged sky-0.0.1

I’m starting by defining a basic data model, i.e. how Sky values will be represented in C (the implementation language). If you’re already familiar with how dynamically-typed data is represented in C this may be a bit boring, but we have to start somewhere.

Being dynamically typed means that type information needs to be attached to each value (as opposed to variable). I’ll use an enum for this.

enum type_tag {
    TAG_NONE = 0,
    TAG_INT,
    TAG_CHAR,
    TAG_STRING,
    TAG_SYMBOL,
    TAG_LIST
};

As you can see, the types I’m defining at this stage are integers, characters, strings, symbols, and lists. There are a few types I’ll almost certainly add later on (e.g. functions) and others that, while important and interesting in their own right, I may skip (e.g. floating point numbers).

Next I define the C structures that will represent Sky values.

struct string {
    ptrdiff_t len;
    unsigned char *data;
};

struct symbol {
    struct object *name;
};

struct list {
    struct object *first;
    struct object *rest;
};

struct object {
    enum type_tag tag;

    union {
        intptr_t i;
        int c;
        struct string string;
        struct symbol symbol;
        struct list list;
    } u;
};

Using a union embedded in a struct alongside a type tag is probably the most obvious way to represent dynamically-typed values in C, but it’s not the only option and does have a couple notable downsides. Most importantly, a union is always the size of its largest member. In this case, that means that on an x86-64 the union uses 16 bytes (as required by struct string and struct list), even when it only contains a 4-byte int (representing a character code). However, that concern is purely hypothetical for now, and I’ll most likely revise the data model a few times going forward anyway, so I’m not going to worry about it.

Next up are constructors and accessors for the various types. I’ll show those for strings here since they have the most going on.

struct object *make_object(enum type_tag tag)
{
    struct object *obj = xmalloc(sizeof(*obj));
    obj->tag = tag;
    return obj;
}

struct object *make_string(const char *data, ptrdiff_t len)
{
    assert(len >= 0);
    assert(len != PTRDIFF_MAX);
    struct object *obj = make_object(TAG_STRING);
    obj->u.string.len = len;
    obj->u.string.data = xmalloc(len + 1);
    memcpy(obj->u.string.data, data, len);
    obj->u.string.data[len] = '\0';
    return obj;
}

ptrdiff_t string_length(struct object *obj)
{
    return obj->u.string.len;
}

const unsigned char *string_data(struct object *obj)
{
    return obj->u.string.data;
}

int string_ref(struct object *obj, ptrdiff_t i)
{
    const unsigned char *data = string_data(obj);
    return data[i];
}

A few things to note:

I’m using a function xmalloc (defined in memory.c) to allocate memory. It calls abort (via error, defined in error.c) if malloc is unable to allocate memory, so its callers don’t need to check for null pointers.
Unlike C strings, Sky strings include their length, and thus can contain embedded NUL bytes. However, make_string still ensures that the string data is NUL-terminated, so that the string data is always a valid C string. Although, if the string does contain embedded NUL bytes, then Sky and (naive) C code will have different views of the string’s length and value.
Sky’s lists are traditional singly-linked lists. Unlike Common Lisp and Scheme, Sky won’t allow “dotted pairs”, i.e. cons cells with arbitrary elements in their rest (a.k.a. cdr) field. In other words, the second argument to cons must be a list or nil. The latter is represented as a null pointer.
Accessors like string_length, string_data, and string_ref are often implemented as macros rather than functions. Sometimes this is for efficiency (particularly in unoptimized builds where the functions won’t be inlined), or because the the macro definitions would be more concise and could also be used for mutating their arguments. None of those reasons are particularly compelling for Sky, but the main reason I’ve implemented them as functions is so they’re available within a debugger session¹.

Speaking of debuggers, it’s sometimes useful to use one as a sort of REPL. Here’s a (lightly edited) brief example session in gdb where I set a breakpoint on main, run the program, and then experiment with the functions for integers and strings.

~/code/sky $ make clean all
~/code/sky $ gdb src/sky
(gdb) br main
Breakpoint 1 at 0x400c2a: file src/sky.c, line 42.
(gdb) r
Starting program: /home/jbm/code/sky/src/sky
Breakpoint 1, main () at src/sky.c:42
(gdb) set $i = make_integer(100)
(gdb) p $i
$1 = (struct object *) 0x603260
(gdb) p *$i
$2 = {tag = TAG_INT, u = {i = 100, ...}}
(gdb) p integer_data($i)
$3 = 100
(gdb) set $s = make_string("foo", 3)
(gdb) p $s
$4 = (struct object *) 0x603280
(gdb) p *$s
$5 = {tag = TAG_STRING, u = {..., string = {len = 3, data = 0x6032a0 "foo"}, ...}}
(gdb) p string_data($s)
$6 = (const unsigned char *) 0x6032a0 "foo"
(gdb) p string_length($s)
$7 = 3
(gdb) p string_ref($s, 0)
$8 = 102
(gdb) p/c string_ref($s, 0)
$9 = 102 'f'

See the code on GitHub for all the details - I haven’t discussed every line (or even every file) here, just picked out a few details for discussion. If you’ve come across this and have any questions or comments, please feel free to send me an email at the address below.

GDB can make macros available for use while debugging but, as far as I know, LLDB currently cannot. ↩︎