Edd Mann Developer

Rewriting the santa-lang interpreter in Rust, Part 1 - Implementing the Core

After implementing santa-lang in TypeScript (Node) I wanted to explore rewriting the tree-walking interpreter in a lower-level systems language for efficiency and performance gains. My goal was to be able to run the entire Advent of Code 2022 calendar *quicker* than the Node variant. I settled on using Rust due to its blend of high and low level constructs, its vibrant package registry (Cargo), memory management model and previous enjoyable experience using the language. In this first article within the series, I will document how I went about organising the project and rewriting the core language within Rust.

santa-lang

I decided to organise the project using Cargo workspaces, with the language and each (delivery) runtime being colocated in the same monorepo as separate packages. This provided a good level of modularity and structure to the project. Within the language library itself, I broke up each of the individual layers into subdirectory modules, with tests colocated with the specific layer.

├── Cargo.toml
├── runtime
│   └── cli
│       └── Cargo.toml
└── lang
    ├── Cargo.toml
    └── src
        ├── evaluator
        ├── lexer
        ├── parser
        └── runner

To implement the core I used the TypeScript test-suite as documented behaviour I wished to provide. This proved to be a very rewarding experience, where-by investing the time in making a solid test-suite for the initial implementation made it somewhat trivial to follow this time around. Due to the manner in which each layer (Lexer, Parser, Evaluator) is built on-top of each other I was able to work my way up the stack, garnering confidence at each level. Whilst building out the testing framework within Rust I stumbled upon expect_test, which uses macros to provide a frictionless means to validate (and automatically update) test assertions. For data-structure centric outputs such as the Lexer and Parser this was ideal.

Lexer

Within the Lexer I decided to implement the tokenizer as an Iterator which the Parser could consume. I additionally used a macro (shown below) to express the Lexer tokens in an easier to read form than the Token enumeration itself.

#[macro_export]
macro_rules! T {
    [INT] => { $crate::lexer::TokenKind::Integer };
    [+]   => { $crate::lexer::TokenKind::Plus };
    [||]  => { $crate::lexer::TokenKind::PipePipe };
    // ..
}

Parser

With the Lexer implemented and tested, the generated tokens were then used to construct an Abstract Syntax Tree (AST) by-way of the Parser. There were several interesting choices that were made due to using Rust as the implementation language.

In this case an error was handled using an explicit Result type, as opposed to throwing an exception. Upon a parser error being found, a ParserErr would be returned with a detailed message and the source location. To not introduce indentation hell I appreciated how we could express an Err short-circuiting execution using the question mark operator. For example, self.expect(T![ID])? ensures that the next token consumed is an identity else it short-circuits and returns the Err type. This syntax also works for Option types too.

As discussed before, we were able to use the macro tokens that were defined within the Lexer to clearly describe the translation to a meaningful AST. One such example of this was whilst parsing an if expression, combining the representation of the tokens as expected santa-lang syntax along with short-circuiting upon a ParserErr.

fn parse_if_expression(&mut self) -> RExpression {
    let start = self.expect(T![IF])?;

    self.consume_if(T!['(']);
    let condition = Box::new(self.parse_expression(Precedence::Lowest)?);
    self.consume_if(T![')']);
    let consequence = Box::new(self.parse_block_statement()?);
    let alternative = if self.consume_if(T![ELSE]) {
        Some(Box::new(self.parse_block_statement()?))
    } else {
        None
    };

    Ok(Expression {
        kind: ExpressionKind::If { condition, consequence, alternative },
        source: start.source_range(&self.current_token),
    })
}

It was also during Parser development that I became very familiar with the concept of a Box. This type stores a value on the heap rather than the stack - with the size of the smart-pointer known at compile-time, and the value it points to being of indeterminate size until runtime. As we are allocating the value onto the heap this does incur performance penalties. It also provides us with the ability to define recursive types, which is required when building up the AST. For example, in the case of Box<Expression>, we do not know the exact type/size of Expression at compile-time as we are required to parse the santa-lang source code provided at runtime to decern this.

Evaluator

The largest undertaking was evaluating the parsed AST. In this step I opted to use the concept of Frames to model the stacked computation. Represented as an enumeration, it allowed me to trivially express and destructure the different states.

Like the TypeScript implementation, trying to break up the evaluation behaviour into manageable sized chunks was an important task. To achieve this I decided to break up key behaviours into separate functions (located in separate files). To ensure that there was no performance penalty or unnecessary function invocation for this decision I used inline annotation’s, which will be spoken about more in a future article within the series. Some behaviours that I separated out were language function invocation (user-land, memoized, closures, external and built-in), infix operations and santa-lang’s match expressions.

Rc<RefCell<T>>

Similar to the Parser, as the size of certain values are not known at compile-time, some work has to be delegated to runtime. As evaluation of the santa-lang source code could lead to shared ownership and mutation of given values we could not rely on Rust’s compile-time memory management. It was here that I was introduced to Rc<RefCell<T>>. The Rc keeps tracks of how many references exist to a given value, deallocating it when the count reaches zero. The RefCell provides us with the ability to borrow a mutable value even when the type is deemed immutable. Combined, this allows multiple parts of the program to have read-only access to the same value, whilst allowing parts to mutate it when necessary. This results in runtime errors over our preferred compile-time errors. It should also be noted that this combination only works for single-threaded applications, there are alternative means of being able to solve this for multithreaded applications (i.e. using Arc).

I have a love-hate relationship with this concept, on the one hand it allows me to trivially handle many Object’s being passed around and mutated throughout the evaluators life-time. However, on the other hand it doesn’t feel very Rust like, losing the compile-time memory management guarantees that we hold so dear. Having looked at several different interpreter implementations in Rust I noticed that all had used this concept, but I would really like to go back and see if there is another way to remove the high dependence on this going forward.

Lazy Sequences

In the previous implementation I had made extensive use of the ImmutableJS library, not only for immutable data-structures but the lazy operations that could be applied on them as well. However, when it came to the Rust implementation there was no such library that offered both behaviours. I was fortunately able to use the im-rs library which provided performant Immutable data-structures (of which I found one gotcha bug), but in-regards to the lazy sequences I had to implement these myself.

It was an interesting challenge (especially with Rust’s borrow checker) to handle such computation. Similar to the Lexer I built my own custom Iterator which handled resolving the underlying lazy value and applying operations which had been defined (maps, filters, etc.). I was happy with the resulting behaviour but still feel that this implementation could do with some attention, it still does not feel very Rust-like to me (I wonder what a true Rustacean would think of it).

Standard Library

A large part of the santa-lang evaluator is the standard library that it provides. As the language is dynamically-typed, each function call can be passed many arguments (of differing types) and needs to be able to handle them accordingly. Upon researching a means to handle this within Rust I could see there were two options, dynamic dispatch (via Object type traits) and in-place invocation (via pattern matching). I had used dynamic dispatch within the TypeScript implementation and although highly readable/extendable, it was going to result in runtime performance penalties I was not willing to incur. As such, I was only left with in-place invocation.

Taking inspiration from the enum_dispatch crate, I decided to build my own builtin! function suite of macros, allowing me to remove the boilerplate of choosing inline invocation whilst keeping a clean DSL for defining the many different functions the language had to offer. A good example of this is how the map function is defined, where-by we are able to co-locate the different mapping behaviours based on type.

builtin! {
    map(mapper, collection) [evaluator, source] match {
        (Object::Function(mapper), Object::List(list)) => { // .. }
        (Object::Function(mapper), Object::Set(set)) => { // .. }
        (Object::Function(mapper), Object::Dictionary(map)) => { // .. }
        (Object::Function(mapper), Object::LazySequence(sequence)) => { // .. }
        (Object::Function(mapper), Object::String(string)) => { // .. }
    }
}

Additionally, it catered for multi-arity arguments.

builtin! {
    zip(collection, ..collections) [evaluator, source] match {
        (_, Object::List(collections)) => { // .. }
    }
}

This DSL is one of the areas I am most proud of within the Rust rewrite of the language. It allows me to clearly define a function’s parameters and implement pattern matched behaviours based on the parsed in arguments.

External Functions

Along with providing in-built functions defined within the core language, the TypeScript implementation included a means of adding external functions which are defined during evaluator instantiation. These external functions can be used to provide runtime specific behaviour, for example, implementing I/O concerns such as puts and read. To implement this capability within the interpreter I was required to use a dyn Fn trait object definition over basic function pointers. This provides the ability to pass Closures as external functions - which was a requirement when building the WASM runtime, discussed in the next article. There are some performance disadvantages with having to model it this way, due to extra indirection via the vtable lookup. However, it should be noted that a heap allocation is not required if the trait object it is storing has zero size, which is the case for functions and Closures which do not capture any variables.

Runner

Finally, I was able to build the Advent of Code Runner which was used to parse the AoC file format and execute the solutions (with provided tests). At this step I enjoyed building a Time trait which was used for runtimes to specify how they determined the current time, as we could not solely on a POSIX implementation (i.e. WASM). I also encapsulated the internal Parser and Runtime errors into a type which is publicly accessible to the runtimes using the From trait (From<RuntimeErr> for RunErr). This ensured that the internal error types remained private and were not leaked out of the core domain.

What’s next…

With the core language library now implemented and tested it was time to move on to the different runtimes. In the next post within the series I will document how I went about integrating the core language library into the (delivery) runtimes.