What We Learned Teaching AI a New Language

Science is more useful when you publish the surprises. Here are ours.

We ran Phase D — our primary LLM benchmark — across 10+ models ranging from 1B to 70B parameters, using two task sets: 9 standard Synoema tasks and a 50-task sample from our validated corpus. We tested three prompt configurations: a full baseline reference (~1800 tokens), a compact reference (~900 tokens), and a multipass self-correction approach. We ran 108+ attempts per configuration per model.

Not everything went as expected.

The Setup

Before the results: what we were measuring and how.

The run rate is our primary metric: what percentage of generated programs, when executed, produce the correct output? A program that parses and type-checks but produces wrong output counts as a failure. We report syntax% (programs that parse) and run% (programs that pass all tests) separately, because the gap between them is informative.

The 9 standard tasks cover factorial, fibonacci, fizzbuzz, binary search, error handling, filter-map, quicksort, pattern matching, and type definition. They range from trivial (factorial is solved by every tested model at high reliability) to hard (quicksort and type-definition require understanding Synoema-specific syntax that has no direct Python/JS analogue).

Models were tested via Ollama (local) and OpenRouter (cloud). All results are stored in benchmarks/results/ and can be reproduced with the scripts in benchmarks/scripts/.

H1: Compact Reference Helps — DISPROVED

Our first hypothesis was that the 900-token compact reference would perform as well as the 1800-token baseline, or better, because it's shorter and less likely to dilute the relevant information in the model's context window.

It was wrong. Not marginally wrong — significantly wrong.

ModelTask setBaseline syntax%Compact syntax%Δ syntaxBaseline run%Compact run%Δ run
3Bstandard-987%50%−37pp60%30%−30pp
7Bcorpus-5036%26%−10pp12%12%0pp

For the 3B model, the compact reference caused a 37-percentage-point drop in syntax correctness and a 30pp drop in run rate. The model with fewer prompt tokens produced dramatically worse code. For the 7B model, the run rate held flat, but syntax still dropped 10pp.

The interpretation that makes sense: the baseline reference provides richer syntactic context and more varied examples. Even though it's twice as long, it's not wasted context — the model actively uses the additional examples to calibrate its output. Counter-intuitively, more prompt tokens produced better code up to the sizes we tested.

Practical implication: the compact reference is useful for cost reduction (fewer prompt tokens = cheaper API calls), but not for maximizing code quality. If you're optimizing for correctness, use the full baseline.

H2: Larger Models Are Better — CONFIRMED

This one was expected, but the magnitude surprised us.

Evidence A compared Llama 3.1-8B against Llama 3.3-70B on the same 9 standard tasks with the same prompt:

ModelSizeBaseline syntax%Baseline run%Avg tokens out
llama-3.1-8b8B16%11%222
llama-3.3-70b70B61%61%107

The 70B model achieves 61% run rate; the 8B model achieves 11%. That's a 50-percentage-point gap for an 8x difference in parameter count. Within the same model family, on the same tasks, size has an enormous effect.

Evidence B ran 5 models across the 1B–8B range and measured the correlation between size and syntax rate:

ModelSizeSyntax%Run%
llama-3.2-1b1B0%0%
llama-3.2-3b3B44%0%
qwen2.5-coder-7b7B56%41%
qwen3-8b8B59%7%
llama-3.1-8b8B59%30%

Spearman rank correlation between model size and syntax rate: ρ = 1.00. A perfect monotonic relationship. Syntax rate increases with every step up in model size, without exception.

Notice the 3B model: 44% syntax but 0% run. It generates code that parses but has wrong semantics. The 1B model generates nothing useful at all. The threshold for useful output appears to be somewhere between 3B and 7B for Synoema.

Also notice the run% variance at 8B (7%–41% depending on model family). Qwen3-8B achieves 59% syntax but only 7% run — it knows the surface syntax but gets the semantics wrong. More on this in H5.

H3: Multipass Self-Correction — MODEL-SIZE-DEPENDENT

Multipass shows the model its own output and asks it to review and correct it. We expected this to help. For small models, it hurt significantly.

ModelBaseline run%Multipass run%Δ
3B60%28%−32pp
7B12%16%+4pp

For the 3B model, multipass makes things dramatically worse. The model, when asked to review its own output, introduces errors into working code — corrupting programs that had been correct on the first pass. Small models don't have enough capacity to self-critique reliably; they generate random edits instead.

For the 7B model, multipass provides a small positive signal (+4pp run rate), though with only one repeat, we can't be confident this isn't noise.

The 8B results (from Evidence B) are more interesting: Qwen3-8B with multipass achieves 48% run rate, up from 7% on baseline. This is a massive improvement — multipass essentially corrects the semantic errors that the model gets right syntactically on the first pass. For Qwen3-8B specifically, the self-correction mechanism works.

Implication: disable multipass for models ≤3B. Consider enabling for ≥7B, but measure per model — the effect is highly architecture-dependent.

H4: Feature Difficulty — CONFIRMED

We hypothesized that difficulty would follow: fundamentals < data structures < applications < abstractions. The 50-task corpus test confirmed this, with a nuance.

CategorySyntax%Run%
fundamentals53%26%
data_structures26%13%
applications25%0%
abstractions37%0%

The ordering holds: fundamentals is easiest, then data structures, then everything else. But applications and abstractions both achieve 0% run rate — not a smooth gradient but a sharp cliff. The model can't produce correct output for complex workflows or higher-order functions at all, on the 7B baseline.

The syntax/run gap for abstractions (37% syntax, 0% run) is particularly striking. The model generates syntactically valid Synoema code — it knows what pipe operators and lambdas look like — but the semantics are wrong. It's learned the shape without the meaning.

Per-task examples confirmed this pattern. The hardest tasks had specific characteristics: terse instructions without explicit call sites (model must infer the main expression), Synoema-specific operators without analogues in common languages (pipe, ternary chains), and pattern matching with literal patterns rather than guards.

H5: Architecture Matters — CONFIRMED

At similar parameter counts, architecture and training objective (code-specialized vs general) produce significantly different results.

In the 7–24B tier (same inference cost range), we tested four models:

ModelSizeTypeBaseline run%
gemma-3-12b-it12BGeneral66%
mistral-small-24b24BGeneral50%
llama-3.1-8b8BGeneral11%
qwen2.5-coder-7b7BCode-specialized41%

Gemma-3-12B achieves 66% run rate — the best result in this tier, better than the 24B Mistral model despite having half the parameters. Architecture matters more than size within this range.

The code-specialized Qwen2.5-Coder-7B achieves 41% run rate on baseline, better than the general Llama-3.1-8B (30%) despite fewer parameters. Code specialization provides a meaningful advantage on Synoema generation — plausibly because the training distribution included more structured, typed code that shares patterns with Synoema's syntax.

Practical recommendation: in the 7–24B tier, prefer Gemma-3-12B. For cost-constrained deployment at 7B, prefer Qwen2.5-Coder-7B over general-purpose alternatives.

The Per-Task Breakdown

The aggregate numbers obscure important variation. Looking at individual tasks on the 3B model (baseline):

TaskSyntax%Run%Note
factorial100%100%Trivial recursion, solved by all models
fizzbuzz100%100%Conditional + string, well-represented in training
fibonacci91%91%Near-perfect on small models
binary_search100%75%Logic errors on edge cases
filter_map91%75%List comprehension works, complex composition fails
quicksort16%16%Divide-and-conquer fails on small models
pattern_match91%0%Syntax ok, semantic failure on ADT matching
type_definition91%0%Syntax ok, logic wrong

The pattern_match and type_definition rows are the most interesting. 91% syntax success means the model can write Synoema syntax for these patterns — it knows what ADT definitions and pattern matching look like. But 0% run rate means it consistently gets the semantics wrong. It's producing Synoema-shaped code that doesn't do the right thing.

This is a qualitatively different failure mode from quicksort (16% syntax), where the model fails to produce the language syntax at all. For ADT tasks, the model has learned the form but not the content. Fine-tuning on correct examples for these specific patterns should help — that's what the corpus and fine-tuning experiment are designed to test.

What This Means for Fine-Tuning

The Phase D results shape our fine-tuning strategy directly:

Target the 0%-run categories. Applications and abstractions reach 0% run rate on the baseline 7B model. The corpus specifically includes these categories, and fine-tuning with correct examples in these categories is the intervention most likely to close the gap.

Use the baseline prompt, not compact. H1 is disproved — compact hurts. Our fine-tuning evaluation uses the full 1800-token baseline reference.

Prefer code-specialized base models. H5 is confirmed — starting from Qwen2.5-Coder for fine-tuning gives a better starting point than a general-purpose model of the same size.

Measure per category, not just aggregate. Aggregate run rate conceals the fact that some categories are solved (factorial: 100%) and others are completely open (abstractions: 0%). Category-level metrics are more actionable.

The fine-tuning benchmarks are pending as of this writing. We'll report the results — including the failures — in a follow-up article.

Related Articles

The Scientific Method Behind Synoema

The full methodology: 12 falsifiable hypotheses, statistical tests, corpus design, and reproducibility.

From Zero to 41%: Building an AI That Writes Working Code

The narrative behind the numbers: corpus generation, fine-tuning, and what 59% failure rate means.

Why AI Writes Broken Code — and How Type Systems Can Fix It

The types of errors LLMs make and the structural interventions that address them.