John Mastro: Sky 0.0.6: Making all symbols readable

This is part of a series of posts about writing a simple interpreter for a small Lisp-like language. Please see here for an overview of the series.

Last time we added a not-quite-REPL and some round-trip read/print tests. This time we’re going to address an issue I mentioned at the end of the post about printing: currently, some symbols are printed in such a way that it’s impossible for them to be read back in.

We could accept that as a limitation, just like we’re accepting lots of other limitations. However, this is easy to fix, so we’re going to do it now.

You can see the code as of this post here, and a comparison against last time here.

The problem

To review the problem, every symbol has a name, which is a string. There are no restrictions on what strings can be the name of a symbol; even the empty string is a valid symbol name. And a symbol’s name determines how we print it. However, we need to take special care with some symbol names, because otherwise the result won’t be readable.

For example, consider a symbol with the name "foo bar". It will print as foo bar, but foo bar won’t be read back in as the symbol with the name "foo bar" - instead, it will be read back in as two symbols, foo and bar.

The solution

If we don’t want to say “don’t do that”, the solution is to quote the symbol name somehow. We could pick special quote markers, but it would probably have to be a two-character sequence; something like #|name here|#.

Instead, we’ll use regular old " as the quote character, and precede the value with a tag that tells the reader the following expression is a symbol. The result will be #symbol "name here".

The syntax is inspired by Clojure’s tagged literals, though we’re using it purely for symbols for now. We may make our implementation more general in the future.

The implementation

There are a few different bits to the implementation. First of all, we need a way to identify symbols with names that require quoting:

static bool quotename(value_t name, ptrdiff_t len)
{
    bool intlike = true;

    if (len == 0) return true;

    for (ptrdiff_t i = 0; i < len; i++) {
        int c = string_ref(name, i);
        if (c < '!'     // Control and whitespace characters
            || c > '~'  // DEL
            || c == '(' || c == ')'
            || c == '"'
            || c == ';'
            || (i == 0 && c == '#'))
            return true;
        if (intlike
            && !((i == 0 && (c == '-' || c == '+'))
                 || (c >= '0' && c <= '9')))
            intlike = false;
    }

    return intlike;
}

We then use that predicate when printing a symbol. If it returns true, we print #symbol, and then print the symbol’s name as a string.

static void print_string_1(FILE *stream, value_t value, bool symbol)
{
    ptrdiff_t len = string_length(value);

    if (symbol && quotename(value, len)) {
        fputs("#symbol ", stream);
        symbol = false;
    }

    // ...
}

Finally, when we read #, we check whether it’s followed by symbol. If it is, we read the next expression, verify that it’s a string, and construct the symbol with that name.

if (TOKEQ(buf, len, "symbol")) {
    int flag;
    value_t sexp = read_sexp(stream, 0, &flag);
    // ...
    if (get_type_tag(sexp) == TAG_STRING)
        return make_symbol(sexp);
    // ...
}

There are a few new tests to go along with this, although I wouldn’t be surprised at all if there are edge cases that aren’t handled appropriately. However, they should be easy to fix as they come up.

Next time

Next time we’ll start working on a symbol table, which will ultimately let us intern symbols.