Sky 0.0.4: Reading data

This is part of a series of posts about writing a simple interpreter for a small Lisp-like language. Please see here for an overview of the series.

Last time we added the ability to print Sky data to a C stream. This time we’re adding the inverse, the ability to read Sky data from a C stream. The reader is the most complex component so far, with roughly three times as many lines of code as the printer.

You can see the code as of this post here, and a comparison against last time here.

For various reasons, I have less free time than I foresaw when I first started this series of posts. However, I don’t want to stop or pause the series, at least not until we’ve implemented a basic evaluator. Thus, in an attempt to keep things moving, I’m going to try to keep the posts relatively brief, perhaps skipping some things I would have liked to include.

A brief overview

At a high level read.c, contains a few distinct groups of functions:

  • read_next_char: takes a stream (FILE *) argument, skips over whitespace and comments, and returns the next character
  • read_token: takes a stream argument, fills a caller-provided fixed-size char * buffer, and returns an int of how many bytes it read
  • isdelimiter: takes a character argument and returns true if it’s a delimiter. This is how read_token knows it’s reached the end of a token
  • Functions whose names are prefixed with read_ (besides the two above): take a stream argument and return a Sky value (value_t)
  • Functions whose names are prefixed with parse_: take a string (const char *) argument and return a C value

To give a very brief example of how these pieces fit together, when reading #\newline (the character literal denoting a newline), the call flow would be:

  • read_sexp1 is the main internal entry-point; it’s called for every value read.
  • It calls read_next_char, which returns '#'. We then see that the next character is '\', which tells us that we’re reading a character literal2
  • Thus, we call read_character
  • It calls read_token, which will fill a char * buffer with the remainder of the token (in this case "newline") plus a terminating null byte (to guarantee that it’s a valid C string, even though we also pass the length around)
  • Next we call parse_character_name on the buffer, which figures out that newline means '\n' and returns it
  • Finally, we call make_character (defined earlier in data.c), which constructs the appropriate Sky value (a value_t) and return the result

A design limitation

One design limitation I wanted to maintain at this stage is that we only dynamically allocate memory (i.e. via malloc and friends) when it’s time to construct a valid Sky value. This means there are maximum lengths for literal symbols, characters, and strings. That would be annoying or even unacceptable in a production language, particularly for strings, but it helps keep things simple for us (and has no real cost in our particular context):

#define READ_CHAR_BUFSZ     16
#define READ_ATOM_BUFSZ    128
#define READ_STRING_BUFSZ 1024

This limitation also makes read_list somewhat interesting, which I’ll show here in full:

#define REST(x) (((struct object *)(x))->u.list.rest)

static value_t read_list(FILE *stream)
{
    value_t list = NIL;
    value_t last = NIL;

    for (;;) {
        int flag;
        value_t value = read_sexp(stream, ')', &flag);
        value_t cell;

        if (flag == ')')
            break;
        if (flag == EOF)
            error("EOF while reading");

        cell = cons(value, NIL);

        if (list == NIL)
            last = list = cell;
        else
            last = REST(last) = cell;
    }

    return list;
}

We keep pointers to both the head of the list (list) and the last cons cell in the list (last). Every time we read a new element, we create a new cell, and mutate the rest link of what had been the last element to point at the new cell.

As a result, list is always a valid Sky list. We’ve declared that Sky lists are immutable, and here we are mutating one, but that’s okay because the value is never exposed outside of read_list until it’s the full, complete list.

Why the limitation on dynamically allocating memory? Well, even though this is arguably looking too far ahead, it prevents read errors from either leaking memory or leaving invalid objects around that the garbage collector then has to deal with.

A temporary omission

If you’re familiar with Lisp and looked at read_atom, you would probably notice something missing: intern.

static value_t read_atom(FILE *stream)
{
    char buf[READ_ATOM_BUFSZ];
    int len = read_token(stream, buf, READ_ATOM_BUFSZ, false);
    // Other stuff
    return make_symbol(make_string(buf, len));
}

We’ll add symbol interning in a future installment.

Next time

Next time we’ll add some tests for reading and printing, and add a REPL minus the eval part.


  1. Symbolic expression ↩︎

  2. Why #\ rather than just, say, #? The idea is to use # as a namespace of sorts; so far we’re only using one entry. ↩︎