Dead Simple Testing for Programming Language Projects
Introduction
Say you are working on a new programming language project. The language is just getting out of the proof-of-concept phase, and you want to set up some form of automated testing for the compiler and/or interpreter. What would a testing tool for a programming language even look like? How do you test individual language features that may be mutually dependent on other language features? What can you do to make writing new tests easy? What can you do to make updating existing tests as frictionless as possible? How do you write a test for a syntax error? How do you write a test for something in the language's standard library?
In this article we will look at a simple strategy for testing programming languages and their associated compilers/interpreters. We will examine the test tooling that I have successfully used in two programming language projects, Sunder and Lumpy, in order to learn how that tooling works and why it is so effective. By the end of this post, you should have answers to all of the questions above, as well as a firm understanding of how to build something like this for your own programming language projects.
A Tour of the Test Tool
With the introductions out the way, let's dive right into how Sunder and Lumpy
are tested. Both repositories contain a variation on the same basic shell
script. This script, named
sunder-test
in the Sunder repository and
lumpy-test
in the Lumpy repository, is what gets run when each project's make check
target is executed. The {language}-test
scripts also serve as the standard
test utilities for each language, so when writing a third-party library for
either language, one would also invoke {language}-test
for that library's
tests. Given that these scripts are almost identical, we will primarily focus
on the lumpy-test
script for the remainder of this article.
Alright, so what does the lumpy-test
script actually do, and what do the
tests executed by this script look like? In short, the lumpy-test
script
performs a variation of golden
testing on a provided
test file. A test file (identified by the file extension .test.lumpy
)
contains a Lumpy program alongside the expected output of that program embedded
in comments at the end of the file.
let x = 123 + 456;
println("123 + 456 is " + string(x));
################################################################################
# 123 + 456 is 579
In the above test, the lines:
let x = 123 + 456;
println("123 + 456 is " + string(x));
contain of the actual program being tested. In this case the test is adding two numbers and printing out the result of that addition to the standard output stream. Then, a long line of comment characters separates the program section of the test from the expected output section of the test. Finally, the expected output section of the test:
# 123 + 456 is 579
indicates that the expected output of this test should be the text "123 + 456 is 579"
. When we run this test with the lumpy interpreter, we indeed see that
this text is printed:
$ lumpy add.test.lumpy
123 + 456 is 579
And when we run this test with lumpy-test
, we see that the test tool agrees
that the expected output of the test matches the actual output of the test:
$ lumpy-test add.test.lumpy
[= TEST add.test.lumpy =]
[= PASS =]
TESTS RUN => 1
FAILURES => 0
If the expected output of a test does not match the actual output of a test,
then the test tool will show a diff
of the expected and actual output, and
indicate that the test has failed.
$ cat mul.test.lumpy
let x = 123 - 456; # purposefully wrong
println("123 * 456 is " + string(x));
################################################################################
# 123 * 456 is 56088
$ lumpy-test mul.test.lumpy
[= TEST mul.test.lumpy =]
1c1
< 123 * 456 is 56088
---
> 123 * 456 is -333
[= FAIL =]
TESTS RUN => 1
FAILURES => 1
Multiple tests can be passed as command line arguments to lumpy-test
:
$ lumpy-test add.test.lumpy mul.test.lumpy
[= TEST add.test.lumpy =]
[= PASS =]
[= TEST mul.test.lumpy =]
1c1
< 123 * 456 is 56088
---
> 123 * 456 is -333
[= FAIL =]
TESTS RUN => 2
FAILURES => 1
And if no command line arguments are passed to lumpy-test
then the script
will use find
to locate all files ending with .test.lumpy
and run those
tests automatically:
$ ls
add.test.lumpy mul.test.lumpy
$ lumpy-test
[= TEST ./add.test.lumpy =]
[= PASS =]
[= TEST ./mul.test.lumpy =]
1c1
< 123 * 456 is 56088
---
> 123 * 456 is -333
[= FAIL =]
TESTS RUN => 2
FAILURES => 1
Some really important things that I want to emphasize are (1) test files are
just regular programs and (2) the expected output of the test is right next
to the code that is supposed to generate that expected output. Keeping test
files as regular programs means that debugging a test is as easy as executing
the test over and over in an edit-and-run cycle until the output printed to
your terminal and the expected test output match. When I see that my newly
written mul.test.lumpy
test is failing, I can look at just the actual output
of the test by running the interpreter on that test file:
$ lumpy-test
[= TEST ./add.test.lumpy =]
[= PASS =]
[= TEST ./mul.test.lumpy =]
1c1
< 123 * 456 is 56088
---
> 123 * 456 is -333
[= FAIL =]
TESTS RUN => 2
FAILURES => 1
$ # oops! something is wrong!
$ lumpy mul.test.lumpy
123 * 456 is -333
Walking through and rubber-duck-debugging the test output as part of this
edit-and-run cycle is a lot easier when the expected output is in the same file
as the test. No switching between tabs in my editor, no separate .expected
file that I need to keep in sync with the actual test program, and no
additional test framework boilerplate. While working on Sunder and Lumpy, my
workflow when updating the compiler/interpreter/standard-libary tests tends to
be:
- Write my changeset
- Run the full test suite with
{language}-test
- If a test is failing, run that test individually with
{language}-test {that-test}
- Make additional edits attempting to address the problem
- Run the program standalone and check the behavior changes
- Goto 4 until the test is in the shape I would expect
- Verify the individual test passes when run with
{language}-test {that-test}
- Goto 2 until all tests are passing
- Commit my changeset
This workflow is super easy with the way that the test tooling is set up, and there is very little cognitive load I have to keep track of outside the singular test file I am working with.
Digging Into the Implementation Details
Okay so we have seen what it is like to work with the test tooling. Let us now
take a look at the actual shell script itself and find out why I consider this
testing method to be so "dead simple". At the time of writing, the entirety of
lumpy-test
is as follows:
#!/bin/sh
set -e
export LC_ALL=C
PROGNAME=$(basename "$0")
usage() {
cat <<EOF
Usage: ${PROGNAME} [FILE...]
Options:
-h Display usage information and exit.
EOF
}
for arg in "$@"; do
case "${arg}" in
-h)
usage
exit 0
;;
*)
;;
esac
done
if [ -z "${LUMPY_HOME}" ]; then
LUMPY_HOME=$(pwd)
fi
TESTSRUN=0
FAILURES=0
test() {
TEST="$1"
echo "[= TEST ${TEST} =]"
set +e
RECEIVED=$(\
cd "$(dirname "${TEST}")" 2>&1 && \
python3 "${LUMPY_HOME}/lumpy.py" "$(basename "${TEST}")" 2>&1)
set -e
EXPECTED=$(\
sed -n '/^########\(#\)*/,$p' "${TEST}" |\
sed '1d' |\
sed -E 's/^#[[:space:]]?//g')
if [ "${EXPECTED}" = "${RECEIVED}" ]; then
echo '[= PASS =]'
else
TMPDIR=$(mktemp -d)
trap '{ rm -rf -- "${TMPDIR}"; }' EXIT
printf '%s\n' "${EXPECTED}" >"${TMPDIR}/expected"
printf '%s\n' "${RECEIVED}" >"${TMPDIR}/received"
diff "${TMPDIR}/expected" "${TMPDIR}/received" || true
echo '[= FAIL =]'
FAILURES=$((FAILURES + 1))
fi
TESTSRUN=$((TESTSRUN + 1))
}
TESTS= # empty
if [ "$#" -ne 0 ]; then
for arg in "$@"; do
if [ -d "$(realpath ${arg})" ]; then
FILES=$(find "$(realpath ${arg})" -name '*.test.lumpy' | sort)
TESTS=$(echo "${TESTS}" "${FILES}")
else
TESTS=$(echo "${TESTS}" "${arg}")
fi
done
else
TESTS=$(find . -name '*.test.lumpy' | sort)
fi
for t in ${TESTS}; do
test "${t}"
done
echo "TESTS RUN => ${TESTSRUN}"
echo "FAILURES => ${FAILURES}"
[ "${FAILURES}" -eq 0 ] || exit 1
That is 84 lines of fairly mundane shell. There is not anything particularly
novel happening here, and I am fairly confident that an equivalent or better
version of this script could be built in under a day. I wrote this test tooling
in POSIX shell since find
, sed
, and diff
are all easily accessible on
modern Unix systems, but it is not difficult to imagine a version of this
tooling written in another language such as Python or Rust.
The meat of this script is found in the test
function. This function takes a
relative filepath as an argument, runs the program contained in that file, and
compares the expected output of that program to the actual output generated
when the program was run, signaling a single test failure if the expected and
actual output do not match.
The actual program output is captured though the assignment to RECEIVED
:
RECEIVED=$(\
cd "$(dirname "${TEST}")" 2>&1 && \
python3 "${LUMPY_HOME}/lumpy.py" "$(basename "${TEST}")" 2>&1)
Notice that we cd
into the directory containing the test before running
that test. Some tests will have additional data that needs to be accessed
relative to the location of the test file, and we want to make sure that
relative pathing will behave the same regardless of the directory from which
the testing tool is invoked. An example of this in practice would be something
like an Advent of Code problem, where a test
program such as day-xx.test.lumpy
might want to read the text file
day-xx.input
located in the same directory as the test. It is a lot more
convenient to write:
let text = fs::read("day-xx.input");
than it is to stitch together a path relative to the test file with:
let text = fs::read(module::directory + "/day-xx.input");
Once we have the actual program output, we run this funky sed
chain to
extract out the expected program output:
EXPECTED=$(\
sed -n '/^########\(#\)*/,$p' "${TEST}" |\
sed '1d' |\
sed -E 's/^#[[:space:]]?//g')
Here we look for a line beginning with eight or more #
characters. Both
Sunder and Lumpy have single-line comments that begin with #
, and a different
language would require substiting #
with the appropriate single-line comment
character(s). We then remove that line of #
characters, and strip the leading
#
characters from the remaining text of the expected output section. We also
remove an optional space character just after each #
within the expected
output section so that a line such as # 123 + 456 is 579
will be treated the
same as the line #123 + 456 is 579
. In my opinion, that extra space makes it
significantly easier to read the expected output section when you have been
debugging and testing all day. Just to show this sed
chain in action, let us
take a look at what happens when we run this command on our example test from
earlier:
$ cat add.test.lumpy
let x = 123 + 456;
println("123 + 456 is " + string(x));
################################################################################
# 123 + 456 is 579
$ sed -n '/^########\(#\)*/,$p' add.test.lumpy |\
sed '1d' |\
sed -E 's/^#[[:space:]]?//g'
123 + 456 is 579
If our actual output does not match the expected output, then we diff
the
actual and expected output and display that result to the user:
TMPDIR=$(mktemp -d)
trap '{ rm -rf -- "${TMPDIR}"; }' EXIT
printf '%s\n' "${EXPECTED}" >"${TMPDIR}/expected"
printf '%s\n' "${RECEIVED}" >"${TMPDIR}/received"
diff "${TMPDIR}/expected" "${TMPDIR}/received" || true
And that is pretty much everything. We have the printing of the TEST
and the
PASS
or FAIL
lines which are primarily intended to make reading failing
test diffs easier. We perform some bookkeeping, incriminating the total number
of tests run every time the test
function is executed, and incrementing the
number of failing tests right after we emit a failing test diff. But that is
about it. The script is pretty simple, and it would not be too difficult to
re-write, extend, or customize this script for another programming language
project.
Some More Neat Stuff
This type of golden testing strategy comes with a bunch of extra benefits that are particularly nice when working on compilers and interpreters.
Running each test independently adds a process-level isolation between tests! Program state from the invocation of the compiler/interpreter used for test A will not affect the program state during the invocation of the compiler/interpreter used for test B. Artifacts produced by already-run tests (e.g. files left on the filesystem) will still be visible to future tests, but (in my experience) there are very few instances where that ever actually matters.
Compilers and interpreters don't care about individual language features! In the introduction section, I asked how one would test language features that are mutually dependent on each other. Well from the perspective of the compiler or interpreter there is only one language, and if two language features are dependent on each other, then you do not need to figure out how to test them separately. Just throw them together in one or more tests and make sure you get good coverage over the surface area of both features. There is no dependency injection or friend shenanigans required to make testing easier, as the entry point to the compiler/interpreter packages the entire language together for every test invocation.
Warnings and errors are captured as test output! It is trivial to write tests for lexical, syntactical, and semantic errors when diagnostic messages are picked up as "actual program output" by the testing tool. Take a look at this example from the Sunder project repository verifying that the correct error message is emitted when an invalid number of arguments are passed to a function:
$ cat tests/error-call-invalid-argument-count.test.sunder
func foo(a_: usize, b_: ssize, c_: bool) ssize {
return 123s;
}
func main() void {
foo(1u, 2s);
}
################################################################################
# [error-call-invalid-argument-count.test.sunder:6] error: function with type `func(usize, ssize, bool) ssize` expects 3 argument(s) (2 provided)
# foo(1u, 2s);
# ^
$ sunder-run tests/error-call-invalid-argument-count.test.sunder
[tests/error-call-invalid-argument-count.test.sunder:6] error: function with type `func(usize, ssize, bool) ssize` expects 3 argument(s) (2 provided)
foo(1u, 2s);
^
$ sunder-test tests/error-call-invalid-argument-count.test.sunder
[= TEST tests/error-call-invalid-argument-count.test.sunder =]
[= PASS =]
TESTS RUN => 1
FAILURES => 0
The core language and standard library use the same testing tool! Once again, the testing tool only cares about actual vs expected program output. From the perspective of the tool, the core language, builtins, standard library, and even external libraries are all the same thing. Run program, get output. This is refreshingly straightforward, and in practice I have found that that writing and updating tests requires very little mental context switching. It is difficult to express how nice it is to know that opening any test file will present you with a normal program.
Some Less Neat Stuff
Okay so I am a huge fan of this approach to golden testing, and I have no intentions to deviate from this strategy in future programming language projects. That being said, this methodology is not without its issues.
Golden testing in general requires deterministic output, and if a program's output is non-deterministic, then this kind of testing is useless! For example, stringified references in Lumpy contain the hex-encoded address of the referenced object in the stringified text. The same program will produce a different stringified address from run to run, so it would be difficult to write a test validating the non-deterministic behavior of that stringification:
$ cat reference.lumpy
println(123.&);
$ lumpy reference.lumpy
reference@0x100ea7200
$ lumpy reference.lumpy
reference@0x100f97200
Unrelated changes can break exist tests! Say I wanted to add a new warning to the Sunder compiler that will inform a user when they have a function call with an unused return value. The patch for it is pretty simple, but there are currently hundreds of tests (and a few places in the the standard library) containing call-expressions with (purposefully) unused return values. This small change would require me to update all of those tests even though they do not have a whole lot to do with the actual changeset. This is a form of spooky action at a distance, and is less-than-ideal in my opinion.
~/sources/sunder[main !]$ git rev-parse HEAD
0a3595e264b0830a43285ea1c2cfd4f267673f82
~/sources/sunder[main !]$ git diff
diff --git a/resolve.c b/resolve.c
index ff8e091..9c0e5f1 100644
--- a/resolve.c
+++ b/resolve.c
@@ -3964,6 +3964,13 @@ resolve_stmt_expr(struct resolver* resolver, struct cst_stmt const* stmt)
}
struct stmt* const resolved = stmt_new_expr(stmt->location, expr);
+ if (expr->kind == EXPR_CALL && expr->type->kind != TYPE_VOID) {
+ warning(
+ expr->location,
+ "unused return value of type `%s`",
+ expr->type->name);
+ }
+
freeze(resolved);
return resolved;
}
~/sources/sunder[main !]$ make check 2>&1 | tail -n3
TESTS RUN => 550
FAILURES => 222
make: *** [check] Error 1
Conclusion
Despite some of the limitations, I have found this flavor of golden testing to be the most effective way to test my programming languages. The tooling shown in this article is easy to build from scratch, is pleasant to work with, and ends up scaling reasonably well. I have spent a good chunk of the last four years working on programming languages that utilize this tooling, and I have nothing but good things to say about the testing experience. I encourage prospective language authors to experiment with this technique and remix it to suit the needs of your project. Cheers!