This is the first post in what will be a series of posts about implementing an interpreter for a small Lisp-like language, which I’m calling Sky. I’ve put up a page here with more information about my goals for this project. The code as of this post is tagged sky-0.0.1
I’m starting by defining a basic data model, i.e. how Sky values will be represented in C (the implementation language). If you’re already familiar with how dynamically-typed data is represented in C this may be a bit boring, but we have to start somewhere.
Being dynamically typed means that type information needs to be attached to each
value (as opposed to variable). I’ll use an enum
for this.
enum type_tag {
TAG_NONE = 0,
TAG_INT,
TAG_CHAR,
TAG_STRING,
TAG_SYMBOL,
TAG_LIST
};
As you can see, the types I’m defining at this stage are integers, characters, strings, symbols, and lists. There are a few types I’ll almost certainly add later on (e.g. functions) and others that, while important and interesting in their own right, I may skip (e.g. floating point numbers).
Next I define the C structures that will represent Sky values.
struct string {
ptrdiff_t len;
unsigned char *data;
};
struct symbol {
struct object *name;
};
struct list {
struct object *first;
struct object *rest;
};
struct object {
enum type_tag tag;
union {
intptr_t i;
int c;
struct string string;
struct symbol symbol;
struct list list;
} u;
};
Using a union
embedded in a struct
alongside a type tag is probably the most
obvious way to represent dynamically-typed values in C, but it’s not the only
option and does have a couple notable downsides. Most importantly, a union
is
always the size of its largest member. In this case, that means that on an
x86-64 the union
uses 16 bytes (as required by struct string
and struct
list
), even when it only contains a 4-byte int
(representing a character
code). However, that concern is purely hypothetical for now, and I’ll most
likely revise the data model a few times going forward anyway, so I’m not going
to worry about it.
Next up are constructors and accessors for the various types. I’ll show those for strings here since they have the most going on.
struct object *make_object(enum type_tag tag)
{
struct object *obj = xmalloc(sizeof(*obj));
obj->tag = tag;
return obj;
}
struct object *make_string(const char *data, ptrdiff_t len)
{
assert(len >= 0);
assert(len != PTRDIFF_MAX);
struct object *obj = make_object(TAG_STRING);
obj->u.string.len = len;
obj->u.string.data = xmalloc(len + 1);
memcpy(obj->u.string.data, data, len);
obj->u.string.data[len] = '\0';
return obj;
}
ptrdiff_t string_length(struct object *obj)
{
return obj->u.string.len;
}
const unsigned char *string_data(struct object *obj)
{
return obj->u.string.data;
}
int string_ref(struct object *obj, ptrdiff_t i)
{
const unsigned char *data = string_data(obj);
return data[i];
}
A few things to note:
- I’m using a function
xmalloc
(defined in memory.c) to allocate memory. It callsabort
(viaerror
, defined in error.c) ifmalloc
is unable to allocate memory, so its callers don’t need to check for null pointers. - Unlike C strings, Sky strings include their length, and thus can contain
embedded
NUL
bytes. However,make_string
still ensures that the string data isNUL
-terminated, so that the string data is always a valid C string. Although, if the string does contain embeddedNUL
bytes, then Sky and (naive) C code will have different views of the string’s length and value. - Sky’s lists are traditional singly-linked lists. Unlike Common Lisp and
Scheme, Sky won’t allow “dotted pairs”, i.e. cons cells with arbitrary
elements in their
rest
(a.k.a.cdr
) field. In other words, the second argument tocons
must be a list ornil
. The latter is represented as a null pointer. - Accessors like
string_length
,string_data
, andstring_ref
are often implemented as macros rather than functions. Sometimes this is for efficiency (particularly in unoptimized builds where the functions won’t be inlined), or because the the macro definitions would be more concise and could also be used for mutating their arguments. None of those reasons are particularly compelling for Sky, but the main reason I’ve implemented them as functions is so they’re available within a debugger session1.
Speaking of debuggers, it’s sometimes useful to use one as a sort of REPL.
Here’s a (lightly edited) brief example session in gdb
where I set a
breakpoint on main
, run the program, and then experiment with the functions
for integers and strings.
~/code/sky $ make clean all
~/code/sky $ gdb src/sky
(gdb) br main
Breakpoint 1 at 0x400c2a: file src/sky.c, line 42.
(gdb) r
Starting program: /home/jbm/code/sky/src/sky
Breakpoint 1, main () at src/sky.c:42
(gdb) set $i = make_integer(100)
(gdb) p $i
$1 = (struct object *) 0x603260
(gdb) p *$i
$2 = {tag = TAG_INT, u = {i = 100, ...}}
(gdb) p integer_data($i)
$3 = 100
(gdb) set $s = make_string("foo", 3)
(gdb) p $s
$4 = (struct object *) 0x603280
(gdb) p *$s
$5 = {tag = TAG_STRING, u = {..., string = {len = 3, data = 0x6032a0 "foo"}, ...}}
(gdb) p string_data($s)
$6 = (const unsigned char *) 0x6032a0 "foo"
(gdb) p string_length($s)
$7 = 3
(gdb) p string_ref($s, 0)
$8 = 102
(gdb) p/c string_ref($s, 0)
$9 = 102 'f'
See the code on GitHub for all the details - I haven’t discussed every line (or even every file) here, just picked out a few details for discussion. If you’ve come across this and have any questions or comments, please feel free to send me an email at the address below.
-
GDB can make macros available for use while debugging but, as far as I know, LLDB currently cannot. ↩︎