Machine Learning One

Making WAT Livable: Macros, String Inlining, and Local Hoisting

Extending WAT with higher-level language features through a multipass preprocessor.

In the previous article, we felt the pain of raw WAT: 5 lines for every argument load, no inline strings, locals forced to the function top. This article builds the cure — a three-pass preprocessor that transforms an extended WAT dialect into standard WAT.

The Three Passes

The preprocessor takes a WAT function body and produces:

  1. A transformed body (standard WAT)
  2. A list of data sections (static data to write into memory before execution)
  3. An initial memory top (past the static data zone)
pub struct PreprocessResult {
    pub body: String,
    pub data_sections: Vec<DataSection>,
    pub initial_top: u32,
}

pub struct DataSection {
    pub offset: u32,
    pub bytes: Vec<u8>,
}

The three passes are:

  • Pass 0: Macro expansion(argv N $var), (check $err), (resv $ptr) → standard WAT
  • Pass 1: String inlining"hello" and """multi-line"""(i32.const offset) + data sections
  • Pass 2: Local hoisting(local ...) declarations moved to function top, deduped

Pass 0: Macro Expansion

The tree-sitter-wat-plus Grammar

The preprocessor needs to parse WAT-plus source into an AST — standard WAT plus our macro extensions. We started from the existing tree-sitter-wasm grammar, which provides a full tree-sitter grammar for the WebAssembly text format. We then stripped out everything irrelevant to our i32-only subset (floating-point instructions, SIMD, reference types, table operations) and extended the grammar with three new node types: argv_macro, check_macro, and resv_macro. The result is tree-sitter-wat-plus, a minimal grammar that parses exactly the language our LLM generates.

Starting from an established WAT grammar instead of writing one from scratch was important — tree-sitter grammars for S-expression languages have subtle corner cases around nested parentheses, comments, and string escaping. The wasm-lsp grammar had already solved these.

The preprocessor uses this grammar to parse the body into an AST, walks the AST to find macro nodes, expands each one, and splices the expansions into the source text.

The argv Macro

Input:

(argv 0 $url_ptr)

Expansion:

(local $url_ptr i32)
(local $url_err i32)
(call $sys.argv (i32.const 0))
(local.set $url_err)
(local.set $url_ptr)
(if (i32.ne (local.get $url_err) (i32.const 0))
    (then (return (local.get $url_err)))
)

One line becomes seven. The error local name is derived automatically: $url_ptr$url_err (strip _ptr suffix if present, append _err):

fn derive_err_name(name: &str) -> String {
    let base = name.strip_prefix('$').unwrap_or(name);
    let base = base.strip_suffix("_ptr").unwrap_or(base);
    format!("${}_err", base)
}

So $query_ptr$query_err, $seed$seed_err, $body_ptr$body_err.

The check Macro

Input:

(check $body_err)

Expansion:

(if (i32.ne (local.get $body_err) (i32.const 0))
    (then (return (local.get $body_err)))
)

One line becomes three. This is the error guard pattern — if the error code is non-zero, return it immediately.

The resv Macro

Input:

(resv $result_ptr)

Expansion:

(call $sys.resv (local.get $result_ptr))

Implementation

The macro expander walks the tree-sitter AST with an iterative DFS, collecting MacroHit structs:

struct MacroHit {
    start: usize,   // byte offset in source
    end: usize,     // byte offset in source
    expansion: String,
}

For each macro node, it extracts the arguments (index and variable name for argv, variable name for check/resv) and formats the expansion string. After collecting all hits, it sorts them by descending position and splices the expansions back-to-front (to avoid offset shifts):

hits.sort_by(|a, b| b.start.cmp(&a.start));
let mut result = source.to_vec();
for hit in &hits {
    result.splice(hit.start..hit.end, hit.expansion.bytes());
}

If no macros are found, the function returns None (fast path — avoids re-parsing overhead).

Pass 1: String Inlining

After macro expansion, the preprocessor re-parses with tree-sitter and walks the AST again, this time collecting string_literal nodes.

String Formats

Two formats are supported:

Single-line strings: "hello world" — standard WAT escape sequences apply:

  • \n, \t, \r — whitespace escapes
  • \", \', \\ — literal characters
  • \HH — two hex digits → one byte
  • \u{hex...} — Unicode codepoint (UTF-8 encoded)

Triple-quoted strings: """multi-line text""" — raw strings with leading/trailing newline stripped. No escape processing.

Deduplication

If the same string appears multiple times, only one copy is stored in memory. The deduplication uses an ordered vector as a map (small N makes this efficient):

let mut string_map: Vec<(Vec<u8>, u32)> = Vec::new();
let mut current_offset: u32 = 0;

for hit in &string_hits {
    if !string_map.iter().any(|(k, _)| k == &hit.decoded) {
        let entry_size = align_up(4 + hit.decoded.len() as u32, 4);
        string_map.push((hit.decoded.clone(), current_offset));
        current_offset += entry_size;
    }
}
let initial_top = current_offset;

Memory Layout

Each string entry is 4-byte aligned:

[u32 LE length][content bytes][zero padding to 4-byte boundary]

Example: "hello" (5 bytes) → [05 00 00 00][68 65 6c 6c 6f][00 00 00] = 12 bytes.

fn align_up(x: u32, align: u32) -> u32 {
    (x + align - 1) & !(align - 1)
}

fn build_data_section(offset: u32, content: &[u8]) -> DataSection {
    let len = content.len() as u32;
    let entry_size = align_up(4 + len, 4) as usize;
    let mut bytes = Vec::with_capacity(entry_size);
    bytes.extend_from_slice(&len.to_le_bytes());
    bytes.extend_from_slice(content);
    while bytes.len() < entry_size {
        bytes.push(0);
    }
    DataSection { offset, bytes }
}

Replacement

Each string literal in the source is replaced with (i32.const <offset>):

for hit in &string_hits {
    let offset = string_map.iter()
        .find(|(k, _)| k == &hit.decoded)
        .map(|(_, o)| *o)
        .unwrap_or(0);
    edits.push(Edit {
        start: hit.start,
        end: hit.end,
        replacement: format!("(i32.const {})", offset),
    });
}

Before: (call $http.get "https://example.com") After: (call $http.get (i32.const 0)) Data section at offset 0: [19 00 00 00]https://example.com[00] (19 bytes + padding)

The data sections are written into linear memory by LinkedModule.instantiate() before the program runs. The bump allocator's initial_top is set past the static data zone, so dynamic allocations don't overwrite the strings.

Pass 2: Local Hoisting

The final pass collects all (local ...) declarations from the body, removes them from their original positions, and places them at the function top.

Deduplication Rules

  • Named locals ((local $x i32)): first occurrence kept, duplicates removed
  • Anonymous locals ((local i32)): all kept (they represent distinct stack slots)
let mut seen_names: Vec<String> = Vec::new();
let mut unique_locals: Vec<String> = Vec::new();

for hit in &local_hits {
    locals_to_remove.push((hit.start, hit.end));
    match &hit.name {
        Some(name) => {
            if !seen_names.contains(name) {
                seen_names.push(name.clone());
                unique_locals.push(hit.text.trim().to_string());
            }
        }
        None => {
            unique_locals.push(hit.text.trim().to_string());
        }
    }
}

This is important because the argv macro generates (local $name i32) declarations. If the LLM also declares (local $name i32) manually, we get a duplicate. WAT doesn't allow duplicate named locals, so deduplication prevents compilation errors.

Applying All Edits

Passes 1 and 2 generate edits simultaneously. All edits (string replacements + local removals) are sorted by descending start position and applied back-to-front:

edits.sort_by(|a, b| b.start.cmp(&a.start));
let mut result_bytes = source.to_vec();
for edit in &edits {
    result_bytes.splice(edit.start..edit.end, edit.replacement.bytes());
}

let final_body = if !unique_locals.is_empty() {
    let preamble = unique_locals.join("\n");
    format!("{}\n{}", preamble, cleaned_body)
} else {
    cleaned_body.to_string()
};

Before and After

The ugly KV program from Article 7 (35 lines):

(local $key_ptr i32)
(local $key_err i32)
(local $val_ptr i32)
(local $val_err i32)
(local $set_err i32)
(local $get_ptr i32)
(local $get_err i32)

(call $sys.argv (i32.const 0))
(local.set $key_err)
(local.set $key_ptr)
(if (i32.ne (local.get $key_err) (i32.const 0))
    (then (return (local.get $key_err)))
)
(call $sys.argv (i32.const 1))
(local.set $val_err)
(local.set $val_ptr)
(if (i32.ne (local.get $val_err) (i32.const 0))
    (then (return (local.get $val_err)))
)
(call $kv.set (local.get $key_ptr) (local.get $val_ptr))
(local.set $set_err)
(if (i32.ne (local.get $set_err) (i32.const 0))
    (then (return (local.get $set_err)))
)
(call $kv.get (local.get $key_ptr))
(local.set $get_err)
(local.set $get_ptr)
(if (i32.ne (local.get $get_err) (i32.const 0))
    (then (return (local.get $get_err)))
)
(call $sys.resv (local.get $get_ptr))
(i32.const 0)

With the preprocessor (13 lines):

(argv 0 $key_ptr)
(argv 1 $val_ptr)
(call $kv.set (local.get $key_ptr) (local.get $val_ptr))
(local $set_err i32)
(local.set $set_err)
(check $set_err)
(call $kv.get (local.get $key_ptr))
(local $get_err i32) (local $get_ptr i32)
(local.set $get_err)
(local.set $get_ptr)
(check $get_err)
(resv $get_ptr)
(i32.const 0)

The fetch-and-summarize program that was impossible in raw WAT:

(argv 0 $url_ptr)
(call $http.get (local.get $url_ptr))
(local $body_err i32) (local $body_ptr i32)
(local.set $body_err)
(local.set $body_ptr)
(check $body_err)
(call $ai.assist "Summarize in one paragraph." (local.get $body_ptr) (call $sys.nil))
(local $sum_err i32) (local $sum_ptr i32)
(local.set $sum_err)
(local.set $sum_ptr)
(check $sum_err)
(resv $sum_ptr)
(i32.const 0)

The string "Summarize in one paragraph." is automatically inlined into memory. The (argv 0 $url_ptr) macro handles all the argument loading boilerplate. The (local ...) declarations appear next to the code that uses them — the preprocessor hoists them to the top automatically.

Tests

The preprocessor has comprehensive tests. Here are the key cases:

#[test]
fn single_string_literal() {
    let body = r#"(call $f "hello")"#;
    let result = preprocess(body).unwrap();
    assert_eq!(result.initial_top, 12); // align_up(4+5, 4) = 12
    assert_eq!(result.data_sections.len(), 1);
    assert!(result.body.contains("(i32.const 0)"));
}

#[test]
fn duplicate_string_literals_dedup() {
    let body = r#"(call $f "hello")
(call $g "hello")"#;
    let result = preprocess(body).unwrap();
    assert_eq!(result.data_sections.len(), 1); // Only one copy
    let count = result.body.matches("(i32.const 0)").count();
    assert_eq!(count, 2); // Both point to same offset
}

#[test]
fn argv_macro_basic() {
    let result = preprocess("(argv 0 $query_ptr)").unwrap();
    assert!(result.body.contains("(local $query_ptr i32)"));
    assert!(result.body.contains("(local $query_err i32)"));
    assert!(result.body.contains("(call $sys.argv (i32.const 0))"));
}

#[test]
fn mid_body_local_hoisted() {
    let body = "(call $sys.argv (i32.const 0))\n(local $x i32)\n(local.set $x)";
    let result = preprocess(body).unwrap();
    assert!(result.body.starts_with("(local $x i32)"));
}

#[test]
fn duplicate_named_local_deduped() {
    let body = "(local $x i32)\n(call $f)\n(local $x i32)";
    let result = preprocess(body).unwrap();
    let count = result.body.matches("(local $x i32)").count();
    assert_eq!(count, 1);
}

Run the full test suite with cargo test -p rt.

How the Template Uses the Preprocessor

The Template.assemble() method calls the preprocessor first, then wraps the result in a complete module:

pub fn assemble(&self, body: &str) -> Result<AssembleResult> {
    let pre = crate::preprocessor::preprocess(body)?;

    let mut wat = String::from("(module\n");
    for import in &self.wat_imports {
        wat.push_str(import);
        wat.push('\n');
    }
    wat.push_str("    (memory $mem.tape 1)\n");
    wat.push_str("    (export \"mem.tape\" (memory $mem.tape))\n\n");
    wat.push_str("    (func $run (result i32)\n");
    wat.push_str(&pre.body);
    wat.push_str("\n    )\n");
    wat.push_str("    (export \"run\" (func $run))\n");
    wat.push_str(")\n");

    Ok(AssembleResult {
        wat,
        data_sections: pre.data_sections,
        initial_top: pre.initial_top,
    })
}

The data sections and initial_top are passed through to LinkedModule.instantiate(), which writes the static data into memory and sets the bump allocator's starting offset.

The Full Processing Pipeline

LLM generates WAT-plus body


   Pass 0: expand_macros()
   (argv N $var) → full argv sequence
   (check $err)  → error guard
   (resv $ptr)   → sys.resv call


   Pass 1: String inlining
   "hello" → (i32.const 0)
   Data sections: [{offset: 0, bytes: [5,0,0,0,h,e,l,l,o,0,0,0]}]


   Pass 2: Local hoisting
   Move (local ...) to function top
   Deduplicate named locals


   Template.assemble()
   Wrap in (module ... imports ... (func $run ...) ...)


   Engine.compile_wat()
   Compile to wasmtime::Module


   LinkedModule.instantiate()
   Write data sections → set initial_top → create Instance


   Instance.run()

With the preprocessor in place, the LLM can generate concise, readable WAT-plus code, and the runtime handles all the mechanical transformations.

On this page