This is the first post in what will be a series of posts about implementing an interpreter for a small Lisp-like language, which I’m calling Sky. I’ve put up a page here with more information about my goals for this project. The code as of this post is tagged sky-0.0.1
I’m starting by defining a basic data model, i.e. how Sky values will be represented in C (the implementation language). If you’re already familiar with how dynamically-typed data is represented in C this may be a bit boring, but we have to start somewhere.
Being dynamically typed means that type information needs to be attached to each
value (as opposed to variable). I’ll use an enum for this.
enum type_tag {
TAG_NONE = 0,
TAG_INT,
TAG_CHAR,
TAG_STRING,
TAG_SYMBOL,
TAG_LIST
};
As you can see, the types I’m defining at this stage are integers, characters, strings, symbols, and lists. There are a few types I’ll almost certainly add later on (e.g. functions) and others that, while important and interesting in their own right, I may skip (e.g. floating point numbers).
Next I define the C structures that will represent Sky values.
struct string {
ptrdiff_t len;
unsigned char *data;
};
struct symbol {
struct object *name;
};
struct list {
struct object *first;
struct object *rest;
};
struct object {
enum type_tag tag;
union {
intptr_t i;
int c;
struct string string;
struct symbol symbol;
struct list list;
} u;
};
Using a union embedded in a struct alongside a type tag is probably the most
obvious way to represent dynamically-typed values in C, but it’s not the only
option and does have a couple notable downsides. Most importantly, a union is
always the size of its largest member. In this case, that means that on an
x86-64 the union uses 16 bytes (as required by struct string and struct
list), even when it only contains a 4-byte int (representing a character
code). However, that concern is purely hypothetical for now, and I’ll most
likely revise the data model a few times going forward anyway, so I’m not going
to worry about it.
Next up are constructors and accessors for the various types. I’ll show those for strings here since they have the most going on.
struct object *make_object(enum type_tag tag)
{
struct object *obj = xmalloc(sizeof(*obj));
obj->tag = tag;
return obj;
}
struct object *make_string(const char *data, ptrdiff_t len)
{
assert(len >= 0);
assert(len != PTRDIFF_MAX);
struct object *obj = make_object(TAG_STRING);
obj->u.string.len = len;
obj->u.string.data = xmalloc(len + 1);
memcpy(obj->u.string.data, data, len);
obj->u.string.data[len] = '\0';
return obj;
}
ptrdiff_t string_length(struct object *obj)
{
return obj->u.string.len;
}
const unsigned char *string_data(struct object *obj)
{
return obj->u.string.data;
}
int string_ref(struct object *obj, ptrdiff_t i)
{
const unsigned char *data = string_data(obj);
return data[i];
}
A few things to note:
- I’m using a function
xmalloc(defined in memory.c) to allocate memory. It callsabort(viaerror, defined in error.c) ifmallocis unable to allocate memory, so its callers don’t need to check for null pointers. - Unlike C strings, Sky strings include their length, and thus can contain
embedded
NULbytes. However,make_stringstill ensures that the string data isNUL-terminated, so that the string data is always a valid C string. Although, if the string does contain embeddedNULbytes, then Sky and (naive) C code will have different views of the string’s length and value. - Sky’s lists are traditional singly-linked lists. Unlike Common Lisp and
Scheme, Sky won’t allow “dotted pairs”, i.e. cons cells with arbitrary
elements in their
rest(a.k.a.cdr) field. In other words, the second argument toconsmust be a list ornil. The latter is represented as a null pointer. - Accessors like
string_length,string_data, andstring_refare often implemented as macros rather than functions. Sometimes this is for efficiency (particularly in unoptimized builds where the functions won’t be inlined), or because the the macro definitions would be more concise and could also be used for mutating their arguments. None of those reasons are particularly compelling for Sky, but the main reason I’ve implemented them as functions is so they’re available within a debugger session1.
Speaking of debuggers, it’s sometimes useful to use one as a sort of REPL.
Here’s a (lightly edited) brief example session in gdb where I set a
breakpoint on main, run the program, and then experiment with the functions
for integers and strings.
~/code/sky $ make clean all
~/code/sky $ gdb src/sky
(gdb) br main
Breakpoint 1 at 0x400c2a: file src/sky.c, line 42.
(gdb) r
Starting program: /home/jbm/code/sky/src/sky
Breakpoint 1, main () at src/sky.c:42
(gdb) set $i = make_integer(100)
(gdb) p $i
$1 = (struct object *) 0x603260
(gdb) p *$i
$2 = {tag = TAG_INT, u = {i = 100, ...}}
(gdb) p integer_data($i)
$3 = 100
(gdb) set $s = make_string("foo", 3)
(gdb) p $s
$4 = (struct object *) 0x603280
(gdb) p *$s
$5 = {tag = TAG_STRING, u = {..., string = {len = 3, data = 0x6032a0 "foo"}, ...}}
(gdb) p string_data($s)
$6 = (const unsigned char *) 0x6032a0 "foo"
(gdb) p string_length($s)
$7 = 3
(gdb) p string_ref($s, 0)
$8 = 102
(gdb) p/c string_ref($s, 0)
$9 = 102 'f'
See the code on GitHub for all the details - I haven’t discussed every line (or even every file) here, just picked out a few details for discussion. If you’ve come across this and have any questions or comments, please feel free to send me an email at the address below.
-
GDB can make macros available for use while debugging but, as far as I know, LLDB currently cannot. ↩︎