This is part of a series of posts about writing a simple interpreter for a small Lisp-like language. Please see here for an overview of the series.
Last time we added the ability to print Sky data to a C stream. This time we’re adding the inverse, the ability to read Sky data from a C stream. The reader is the most complex component so far, with roughly three times as many lines of code as the printer.
You can see the code as of this post here, and a comparison against last time here.
For various reasons, I have less free time than I foresaw when I first started this series of posts. However, I don’t want to stop or pause the series, at least not until we’ve implemented a basic evaluator. Thus, in an attempt to keep things moving, I’m going to try to keep the posts relatively brief, perhaps skipping some things I would have liked to include.
A brief overview
At a high level read.c
, contains a few distinct groups of functions:
read_next_char
: takes a stream (FILE *
) argument, skips over whitespace and comments, and returns the next characterread_token
: takes a stream argument, fills a caller-provided fixed-sizechar *
buffer, and returns anint
of how many bytes it readisdelimiter
: takes a character argument and returnstrue
if it’s a delimiter. This is howread_token
knows it’s reached the end of a token- Functions whose names are prefixed with
read_
(besides the two above): take a stream argument and return a Sky value (value_t
) - Functions whose names are prefixed with
parse_
: take a string (const char *
) argument and return a C value
To give a very brief example of how these pieces fit together, when reading
#\newline
(the character literal denoting a newline), the call flow would be:
read_sexp
1 is the main internal entry-point; it’s called for every value read.- It calls
read_next_char
, which returns'#'
. We then see that the next character is'\'
, which tells us that we’re reading a character literal2 - Thus, we call
read_character
- It calls
read_token
, which will fill achar *
buffer with the remainder of the token (in this case"newline"
) plus a terminating null byte (to guarantee that it’s a valid C string, even though we also pass the length around) - Next we call
parse_character_name
on the buffer, which figures out thatnewline
means'\n'
and returns it - Finally, we call
make_character
(defined earlier indata.c
), which constructs the appropriate Sky value (avalue_t
) and return the result
A design limitation
One design limitation I wanted to maintain at this stage is that we only
dynamically allocate memory (i.e. via malloc
and friends) when it’s time to
construct a valid Sky value. This means there are maximum lengths for literal
symbols, characters, and strings. That would be annoying or even unacceptable in
a production language, particularly for strings, but it helps keep things simple
for us (and has no real cost in our particular context):
#define READ_CHAR_BUFSZ 16
#define READ_ATOM_BUFSZ 128
#define READ_STRING_BUFSZ 1024
This limitation also makes read_list
somewhat interesting, which I’ll show
here in full:
#define REST(x) (((struct object *)(x))->u.list.rest)
static value_t read_list(FILE *stream)
{
value_t list = NIL;
value_t last = NIL;
for (;;) {
int flag;
value_t value = read_sexp(stream, ')', &flag);
value_t cell;
if (flag == ')')
break;
if (flag == EOF)
error("EOF while reading");
cell = cons(value, NIL);
if (list == NIL)
last = list = cell;
else
last = REST(last) = cell;
}
return list;
}
We keep pointers to both the head of the list (list
) and the last cons cell in
the list (last
). Every time we read a new element, we create a new cell, and
mutate the rest
link of what had been the last element to point at the new
cell.
As a result, list
is always a valid Sky list. We’ve declared that Sky lists
are immutable, and here we are mutating one, but that’s okay because the value
is never exposed outside of read_list
until it’s the full, complete list.
Why the limitation on dynamically allocating memory? Well, even though this is arguably looking too far ahead, it prevents read errors from either leaking memory or leaving invalid objects around that the garbage collector then has to deal with.
A temporary omission
If you’re familiar with Lisp and looked at read_atom
, you would probably
notice something missing: intern
.
static value_t read_atom(FILE *stream)
{
char buf[READ_ATOM_BUFSZ];
int len = read_token(stream, buf, READ_ATOM_BUFSZ, false);
// Other stuff
return make_symbol(make_string(buf, len));
}
We’ll add symbol interning in a future installment.
Next time
Next time we’ll add some tests for reading and printing, and add a REPL minus
the eval
part.