The Candy Compiler Pipeline

The journey of a file from source code to byte code

by Marcel Garus · 2023-9-29
available at www.marcelgarus.dev/candy-compiler-pipeline

See the previous article for an introduction to Candy, a new programming language.

An interesting property of Candy is that every time your program panics, a single call site can be made responsible. This is a necessary to enable good tooling using fuzzing – for found crashes, the faulty call site can be highlighted in the IDE.

From a compiler-building perspective, this also makes the Candy compiler quite unique. In this article, I'll show a general outline of the different compilation stages.

compiler pipeline mountain

Source Code

We start with source code. Source code is stored as UTF-8 encoded bytes.

As an example, we'll look at the following file. It exports a function inc, which increases the given integer by one:

[int] = use "Core"
[add] = int

inc a :=
  needs (int.is a)
  add a 1

Concrete Syntax Tree (CST)

The first step is a parser, which converts the string of characters into a tree. This so-called Concrete Syntax Tree (or CST) contains all information about the original source code – every comment, every whitespace, even sections of the code that could not be parsed.

The CST can be converted back to original source code and is heavily used for tools such as syntax highlighting or code formatting.

This is the CST for the code above:

... // the `use`
Newline("\n")
... // the destructuring into `add`
Newline("\n")
Newline("\n")
Assignment {
  leftTrailingWhitespace {
    childCall {
      receiverTrailingWhitespace childIdentifier("inc"), whitespace: [Whitespace(" ")] },
      arguments: [Identifier("a")]
    },
    whitespace: [Whitespace(" ")]
  },
  assignment_signTrailingWhitespace {
    childColonEqualsSign,
    whitespace: [Newline("\n"), Whitespace("  ")]
  },
  body: [
    Call {
      receiverTrailingWhitespace childIdentifier("needs"), whitespace: [Whitespace(" ")] },
      arguments: [Parenthesized {
        opening_parenthesisOpeningParenthesis,
        innerCall {
          receiverTrailingWhitespace {
            childStructAccess struct_Identifier("int"), dotDotkeyIdentifier("is") },
            whitespace: [Whitespace(" ")]
          },
          arguments: [Identifier("a")]
        },
        closing_parenthesisClosingParenthesis
      }]
    },
    Newline("\n"),
    Whitespace("  "),
    Call {
      receiverTrailingWhitespace childIdentifier("add"), whitespace: [Whitespace(" ")] },
      arguments: [
        TrailingWhitespace childIdentifier("a"), whitespace: [Whitespace(" ")] },
        Int value1string"1" }
      ]
    }
  ]
}
Newline("\n")

The parser is perhaps the most complicated part of the compilation pipeline because it bridges a very large abstraction gap – all other stages just convert one tree into another, simpler tree.

The parser basically consists of a bunch of functions that look like this:

fn opening_parenthesis(input: &str) -> Option<(&strCst)> {
  let input input.strip_prefix("(")?;
  Some(inputCstKind::OpeningParenthesis.into())
}

They take an input and try to find a specific pattern at the beginning of the string. In this case, the function looks for an opening parenthesis (. If it doesn't find a parenthesis, it returns None. Otherwise, it returns a tuple of the unparsed, remaining part of the string as well as the parsed part.

opening_parenthesis("(abc") -> Some(("abc"OpeningParenthesis))
opening_parenthesis("abc") -> None

More complicated parsers are built from these simpler ones. These get pretty complicated pretty quickly. For example, you might think parsing a parenthesized expression such as (4) is easy – just parse an opening parenthesis, parse the expression, then parse a closing parenthesis.

But will it handle all these edge cases correctly?

# comment
  3
  # another comment
)

(4  # missing closing parenthesis

()  # missing expression

You might think these are extreme edge cases not worth caring about, but the parser is a fundamental part of the tooling that lots of other parts use – for example syntax highlighting. In pratice, the parser runs each time you enter a character into a Candy file, so it needs to be very robust and handle unfinished code gracefully.

Here's the actual parser for parenthesized expressions:

fn parenthesized(input: &strindentationusize) -> Option<(&strRcst)> {
  let (inputopening_parenthesis) = opening_parenthesis(input)?;

  let (inputwhitespace) = whitespaces_and_newlines(inputindentation 1true);
  let inner_indentation if whitespace.is_multiline() {
      indentation 1
  else {
      indentation
  };
  let opening_parenthesis opening_parenthesis.wrap_in_whitespace(whitespace);

  let (inputinner) = expression(inputinner_indentationfalsetruetrue).unwrap_or((
      input,
      CstKind::Error {
          unparsable_input"".to_string(),
          errorCstError::OpeningParenthesisMissesExpression,
      }
      .into(),
  ));

  let (inputwhitespace) = whitespaces_and_newlines(inputindentationtrue);
  let inner inner.wrap_in_whitespace(whitespace);

  let (inputclosing_parenthesis) = closing_parenthesis(input).unwrap_or((
      input,
      CstKind::Error {
          unparsable_input"".to_string(),
          errorCstError::ParenthesisNotClosed,
      }
      .into(),
  ));

  Some((
      input,
      CstKind::Parenthesized {
          opening_parenthesisBox::new(opening_parenthesis),
          innerBox::new(inner),
          closing_parenthesisBox::new(closing_parenthesis),
      }
      .into(),
  ))
}

Even unfinished code is parsed into a somewhat correct tree that contains error nodes when something is wrong:

parenthesized(0"(abc)") -> Some((""Parenthesized {
  opening_parenthesisOpeningParenthesis,
  innerIdentifier("abc"),
  closing_parenthesisClosingParenthesis,
}))

parenthesized(0"abc") -> None

parenthesized(0"()") -> Some((""Parenthesized {
  opening_parenthesisOpeningParenthesis,
  innerError(OpeningParenthesisMissesExpression),
  closing_parenthesisClosingParenthesis,
}))

parenthesized(0"(a") -> Some((""Parenthesized {
  opening_parenthesisOpeningParenthesis,
  innerIdentifier("a"),
  closing_parenthesisError(ParenthesisNotClosed),
}))

An interesting property about Candy in particular is that indentation is more important to the parser than any form of parentheses. This makes parsers even more complicated in some places, but it guarantees that invalid stuff in your code can only have local effects:

foo a =
  fds))){

# still parsed correctly
bar a =
  ...

As you might imagine, writing robust parsers takes some effort. Our complete parser file contains over 3000 lines (although many of those lines are also dedicated to unit tests). The result is worth it though – a robust parser that works even with unfinished code.

Abstract Syntax Tree (AST)

In the next stage, the Abstract Syntax Tree (AST), lots of information that's uninteresting for later stages is removed. In particular, all whitespace and comments are gone. That makes the next stages easier to implement.

assignmentstruct
  identifier $6@"int" = 
  call identifier $0@"use" with these arguments:
    text
      textPart $2@"Core"
assignmentstruct
  identifier $12@"add" = 
  identifier $10@"int"
assignment: $32@"inc" := function (fuzzable) { $33@"a" ->
  call identifier $16@"needs" with these arguments:
    call struct access identifier $18@"int".$20@"Is" with these arguments:
      identifier $22@"a"
  call identifier $26@"add" with these arguments:
    identifier $28@"a"
    int 1
}

As you see, some syntax peculiarieties are also desugured. In the CST, the inc a := ... definition was parsed as an assignment with a call on the left side. In the AST, it's instead converted to a definition of a function with an a parameter.

High-Level Intermediary Representation (HIR)

Deeply nested expressions are something computers don't natively understand. That's why we now turn expressions such as needs (int.is a), into a sequence of simple operations:

1. Look up the Is function from the int struct 2. Call the function 3. Call needs with the result

Because we need to introduce local temporary variables for the intermediary results, this is also a nice opportunity to track which variable names are in scope and relate those names to one another.

The compiler creates this flattened structure by going over the AST recursively and outputting finished expressions into a Context. For example, to compile the HIR code for a List, it first compiles the code for all the list items (which can be complicated expressions themselves) and then creates a list expression which references them.

impl Context<'_> {
  fn compile(&mut selfasts: &[Ast]) -> hir::Id { ... }

  // Adds generated code to `self`, returns the ID of the resulting expression.
  fn compile_single(&mut selfast: &Ast) -> hir::Id {
    match &ast.kind {
      AstKind::Int(Int(int)) => {
          self.push(Some(ast.id.clone()), Expression::Int(int.to_owned()), None)
      }
      AstKind::Identifier(Identifier(name)) => {
          let reference match self.identifiers.get(&name.value) {
              Some(reference) => reference.to_owned(),
              None => return self.push_error(...),
          };
          self.push(Some(ast.id.clone()), Expression::Reference(reference), None)
      }
      AstKind::List(List(items)) => {
          let hir_items items
              .iter()
              .map(|itemself.compile_single(item))
              .collect_vec();
          self.push(Some(ast.id.clone()), Expression::List(hir_items), None)
      }
      ...
  }
}

There are two more things to note:

This is the HIR for our file from above:

$use = { (non-fuzzable) $use:relativePath ->
  $use:importedModule relative to module Examples:playgrounduse $use:relativePath
}
$= $use
$builtinTextConcatenate
$builtinTypeOf
$= Text
$builtinEquals
$builtinIfElse
$builtinToDebugText
$"Core"
$call $with $7
$destructure $into [Int: p$0]
$int p$0
$10 = Nothing
$11 = $int
$12 destructure $11 into [Add: p$0]
$add p$0
$13 = Nothing
$inc = { (fuzzable) $inc:->
  $inc:"Builtins"
  $inc:call $use with $inc:0
  $inc:builtinStructGet
  $inc:= StructGet
  $inc:call $inc:with $inc:$inc:3
  $inc:= $int
  $inc:= Is
  $inc:call $inc:with $inc:$inc:6
  $inc:= $inc:a
  $inc:call $inc:with $inc:8
  $inc:10 "`int.is a` was not satisfied"
  $inc:11 needs $inc:with reason $inc:10
  $inc:12 = $add
  $inc:13 = $inc:a
  $inc:14 1
  $inc:15 call $inc:12 with $inc:13 $inc:14
}
$inc#1 = $inc
$14 = Inc
$15 = [$14: $inc#1]

Mid-Level Intermediate Representation (MIR)

You might have noted that needs are not normal calls in the HIR. That's because they are a feature unique to Candy in that they also attribute fault to a responsible piece of code when they fail. You can read the previous article for more information about why needs behave slightly different from functions.

In the MIR, the responsible pieces of code are handled explicitly – there's a new type of value, a HirId, which is passed around just like all the other values. Each function gets an additional parameter, the HirId of the code that's responsible for fulfilling the needs in the function.

With that in place, needs can now be converted to normal function calls.

# functions have an extra parameter for who's responsible for fulfilling needs
$= { $(+ responsible $2)
  ...
}

# A call such as `foo "hi"` becomes this:
$"hi"
$example:bar  # where foo is called
$call $$($responsible)

That also means that the compiler defines a needs function at the beginning of the MIR. Here's the MIR code for our example:

# anonymous:$generated::needs
$49 = { $$$(+ responsible $1) ->
  $anonymous:$generated::needs
  $builtinEquals
  $= Nothing
  $= True
  $= False
  $10 call $with $$($is responsible)
  $11 builtinIfElse
  # anonymous:$generated::needs:isConditionTrue:then
  $14 = { (responsible $12) ->
    $13 = $8
  }
  # anonymous:$generated::needs:isConditionTrue:else
  $17 = { (responsible $15) ->
    $16 call $with $$($is responsible)
  }
  $18 call $11 with $10 $14 $17 ($is responsible)
  $19 builtinIfElse
  # anonymous:$generated::needs:isConditionBool:then
  $22 = { (responsible $20) ->
    $21 = $7
  }
  # anonymous:$generated::needs:isConditionBool:else
  $26 = { (responsible $23) ->
    $24 "The `condition` must be either `True` or `False`."
    $25 panicking because $24 ($is at fault)
  }
  $27 call $19 with $18 $22 $26 ($is responsible)
  $28 builtinTypeOf
  $29 call $28 with $($is responsible)
  $30 = Text
  $31 call $with $29 $30 ($is responsible)
  $32 builtinIfElse
  # anonymous:$generated::needs:isReasonText:then
  $35 = { (responsible $33) ->
    $34 = $7
  }
  # anonymous:$generated::needs:isReasonText:else
  $39 = { (responsible $36) ->
    $37 "The `reason` must be a text."
    $38 panicking because $37 ($is at fault)
  }
  $40 call $32 with $31 $35 $39 ($is responsible)
  $41 builtinIfElse
  # anonymous:$generated::needs:condition:then
  $44 = { (responsible $42) ->
    $43 = $7
  }
  # anonymous:$generated::needs:condition:else
  $47 = { (responsible $45) ->
    $46 panicking because $($is at fault)
  }
  $48 call $41 with $$44 $47 ($is responsible)
}
$50 = Examples:playground:
# Examples:playground:use
$54 = { $52 (+ responsible $51) ->
  $53 use $52 (relative to Examples:playground; $50 is responsible)
}
...
# Examples:playground:inc
$270 = { $247 (+ responsible $246) ->
  $248 "Builtins"
  $249 = Examples:playground:inc:1
  $250 call $54 with $248 ($249 is responsible)
  $251 builtinStructGet
  $252 = StructGet
  $253 = Examples:playground:inc:4
  $254 call $251 with $250 $252 ($253 is responsible)
  $255 = $153
  $256 = Is
  $257 = Examples:playground:inc:7
  $258 call $254 with $255 $256 ($257 is responsible)
  $259 = $247
  $260 = Examples:playground:inc:9
  $261 call $258 with $259 ($260 is responsible)
  $262 "`int.is a` was not satisfied"
  $263 = Examples:playground:inc:11
  $264 call $49 with $261 $262 $246 ($263 is responsible)
  $265 = $244
  $266 = $247
  $267 1
  $268 = Examples:playground:inc:15
  $269 call $265 with $266 $267 ($268 is responsible)
}
$271 = $270
$272 = Inc
$273 = [$272: $271]

Also notable is that pattern destructuring gets converted into a bunch of ifElse calls, but I'll skip a more detailed explanation here.

Optimizing the MIR

Of course, code like this is horribly inefficient. That's why we also apply several optimizations on the MIR. These are the most important ones:

This is the MIR with optimizations applied (note there's still lots of optimization potential):

...
# anonymous:$generated::needs:isConditionTrue:then
$25 = { (responsible $26) ->
  $27 = $19
}
# anonymous:$generated::needs:isConditionBool:then
$28 = { (responsible $29) ->
  $30 = $17
}
# anonymous:$generated::needs
$31 = { $32 $33 $34 (+ responsible $35) ->
  $36 call $10 with $32 $19 ($is responsible)
  # anonymous:$generated::needs:isConditionTrue:else
  $37 = { (responsible $38) ->
    $39 call $10 with $32 $14 ($is responsible)
  }
  $40 call $11 with $36 $25 $37 ($is responsible)
  # anonymous:$generated::needs:isConditionBool:else
  $41 = { (responsible $42) ->
    $43 panicking because $21 ($35 is at fault)
  }
  $44 call $11 with $40 $28 $41 ($is responsible)
  $45 call $13 with $33 ($35 is responsible)
  $46 call $10 with $45 $18 ($35 is responsible)
  # anonymous:$generated::needs:isReasonText:else
  $47 = { (responsible $48) ->
    $49 panicking because $22 ($35 is at fault)
  }
  $50 call $11 with $46 $28 $47 ($is responsible)
  # anonymous:$generated::needs:condition:else
  $51 = { (responsible $52) ->
    $53 panicking because $33 ($34 is at fault)
  }
  $54 call $11 with $32 $28 $51 ($is responsible)
}
# Examples:playground:inc
$55 = { $56 (+ responsible $57) ->
  $58 call $13 with $56 ($is responsible)
  $59 call $10 with $58 $16 ($is responsible)
  $60 call $31 with $59 $24 $57 ($is responsible)
  $61 call $13 with $56 ($is responsible)
  $62 call $10 with $61 $16 ($is responsible)
  $63 call $31 with $62 $23 $($is responsible)
  $64 call $12 with $56 $20 ($is responsible)
}
$65 = [$15: $55]

Low-Level Intermediate Representation

While the MIR is already quite low-level, it still contains nested functions. Nested functions are necessary because they can close over (or capture) values from the outer scope. For example, check out the following function:

foo a = { -> int.add a b }

Here, foo 4 returns a function that always adds 4 to its argument. Thus, the a needs to continue existing even after foo returns.

In the next compiler stage, those captured variables are explicitly tracked. That also enables un-nesting functions.

Technically, functions also capture all used other functions. To make that easier, the next IR also tracks which values are constants – they don't need to be explicitly captured.

Candy uses reference counting for garbage collection. The operations for increasing and decreasing the reference cout of objects also also explicit in the LIR as dup and drop.

This is the LIR for our file:

# Constants
%= Builtins::equals:6
...
%13 = False
...
%27 = { body_}
%28 = [%14: %27]

# Bodies
...
body_$(+ responsible $1) =
  # Captured IDs: none
  $dup $by 2
  $= %12
  $= %5
  $call $with $($is responsible)
  $= %9
  $= %15
  $= %0
  $call $with $$($is responsible)
  $10 = %26
  $11 = %23
  $12 = %6
  $13 call $10 with $$11 $($12 is responsible)
  $14 = %3
  $15 call $with $($14 is responsible)
  $16 = %4
  $17 call $with $15 $($16 is responsible)
  $18 = %22
  $19 = %7
  $20 = %2
  $21 call $10 with $17 $18 $19 ($20 is responsible)
  $22 = %11
  $23 = %19
  $24 = %1
  $25 call $22 with $$23 ($24 is responsible)
  $26 = $25
  $27 drop $15
  $28 drop $26
...

Byte Code

Finally, the LIR is translated into Byte Code for the Candy VM.

Constant objects are actually created in memory.

Instead of functions with code, there only exists a single Vec of instructions. The VM maintains an instruction pointer, an index into that vector.

Because the VM is stack-based, there are lots of operations for pushing objects to and popping elements from the stack:

# Constant heap
0x55ed48b44300"Error"
...
0x55ed4686efc0"Main"
...
0x55ed48ed9120"`int.is a` was not satisfied"
...

# Instructions
...
# Examples:playground:inc
 73pushConstant inline builtinTypeOf
 74pushFromStack 2
 75pushConstant 0x55ed48be8e00 Builtins::typeOf:5
 76call with argument
 77pushConstant inline builtinEquals
 78pushFromStack 1
 79pushConstant inline Int
 ...
 97pushConstant 0x55ed473716d0 arguments (capturing nothing) → ip-26 }
 98pushFromStack 1
 99pushConstant 0x55ed4686b1a0 "`a | typeIs Int` was not satisfied"
100pushConstant 0x55ed46a39880 Examples:playground:inc:15
101pushConstant 0x55ed468a8f80 Builtins::intAdd:5
102call with arguments
103pushConstant inline builtinIntAdd
104pushFromStack 8
105pushConstant inline 1
106pushConstant 0x55ed468ac1b0 Builtins::intAdd:18
107tailCall with locals and arguments
...

You made it!