What We Learned Teaching AI a New Language

Results April 13, 2026 · ~14 min read

Science is more useful when you publish the surprises. Here are ours.

We ran Phase D — our primary LLM benchmark — across 10+ models ranging from 1B to 70B parameters, using two task sets: 9 standard Synoema tasks and a 50-task sample from our validated corpus. We tested three prompt configurations: a full baseline reference (~1800 tokens), a compact reference (~900 tokens), and a multipass self-correction approach. We ran 108+ attempts per configuration per model.

Not everything went as expected.

The Setup

Before the results: what we were measuring and how.

The run rate is our primary metric: what percentage of generated programs, when executed, produce the correct output? A program that parses and type-checks but produces wrong output counts as a failure. We report syntax% (programs that parse) and run% (programs that pass all tests) separately, because the gap between them is informative.

The 9 standard tasks cover factorial, fibonacci, fizzbuzz, binary search, error handling, filter-map, quicksort, pattern matching, and type definition. They range from trivial (factorial is solved by every tested model at high reliability) to hard (quicksort and type-definition require understanding Synoema-specific syntax that has no direct Python/JS analogue).

Models were tested via Ollama (local) and OpenRouter (cloud). All results are stored in benchmarks/results/ and can be reproduced with the scripts in benchmarks/scripts/.

H1: Compact Reference Helps — DISPROVED

Our first hypothesis was that the 900-token compact reference would perform as well as the 1800-token baseline, or better, because it's shorter and less likely to dilute the relevant information in the model's context window.

It was wrong. Not marginally wrong — significantly wrong.

Model	Task set	Baseline syntax%	Compact syntax%	Δ syntax	Baseline run%	Compact run%	Δ run
3B	standard-9	87%	50%	−37pp	60%	30%	−30pp
7B	corpus-50	36%	26%	−10pp	12%	12%	0pp

For the 3B model, the compact reference caused a 37-percentage-point drop in syntax correctness and a 30pp drop in run rate. The model with fewer prompt tokens produced dramatically worse code. For the 7B model, the run rate held flat, but syntax still dropped 10pp.

The interpretation that makes sense: the baseline reference provides richer syntactic context and more varied examples. Even though it's twice as long, it's not wasted context — the model actively uses the additional examples to calibrate its output. Counter-intuitively, more prompt tokens produced better code up to the sizes we tested.

Practical implication: the compact reference is useful for cost reduction (fewer prompt tokens = cheaper API calls), but not for maximizing code quality. If you're optimizing for correctness, use the full baseline.

H2: Larger Models Are Better — CONFIRMED

This one was expected, but the magnitude surprised us.

Evidence A compared Llama 3.1-8B against Llama 3.3-70B on the same 9 standard tasks with the same prompt:

Model	Size	Baseline syntax%	Baseline run%	Avg tokens out
llama-3.1-8b	8B	16%	11%	222
llama-3.3-70b	70B	61%	61%	107

The 70B model achieves 61% run rate; the 8B model achieves 11%. That's a 50-percentage-point gap for an 8x difference in parameter count. Within the same model family, on the same tasks, size has an enormous effect.

Evidence B ran 5 models across the 1B–8B range and measured the correlation between size and syntax rate:

Model	Size	Syntax%	Run%
llama-3.2-1b	1B	0%	0%
llama-3.2-3b	3B	44%	0%
qwen2.5-coder-7b	7B	56%	41%
qwen3-8b	8B	59%	7%
llama-3.1-8b	8B	59%	30%

Spearman rank correlation between model size and syntax rate: ρ = 1.00. A perfect monotonic relationship. Syntax rate increases with every step up in model size, without exception.

Notice the 3B model: 44% syntax but 0% run. It generates code that parses but has wrong semantics. The 1B model generates nothing useful at all. The threshold for useful output appears to be somewhere between 3B and 7B for Synoema.

Also notice the run% variance at 8B (7%–41% depending on model family). Qwen3-8B achieves 59% syntax but only 7% run — it knows the surface syntax but gets the semantics wrong. More on this in H5.

H3: Multipass Self-Correction — MODEL-SIZE-DEPENDENT

Multipass shows the model its own output and asks it to review and correct it. We expected this to help. For small models, it hurt significantly.

Model	Baseline run%	Multipass run%	Δ
3B	60%	28%	−32pp
7B	12%	16%	+4pp

For the 3B model, multipass makes things dramatically worse. The model, when asked to review its own output, introduces errors into working code — corrupting programs that had been correct on the first pass. Small models don't have enough capacity to self-critique reliably; they generate random edits instead.

For the 7B model, multipass provides a small positive signal (+4pp run rate), though with only one repeat, we can't be confident this isn't noise.

The 8B results (from Evidence B) are more interesting: Qwen3-8B with multipass achieves 48% run rate, up from 7% on baseline. This is a massive improvement — multipass essentially corrects the semantic errors that the model gets right syntactically on the first pass. For Qwen3-8B specifically, the self-correction mechanism works.

Implication: disable multipass for models ≤3B. Consider enabling for ≥7B, but measure per model — the effect is highly architecture-dependent.

H4: Feature Difficulty — CONFIRMED

We hypothesized that difficulty would follow: fundamentals < data structures < applications < abstractions. The 50-task corpus test confirmed this, with a nuance.

Category	Syntax%	Run%
fundamentals	53%	26%
data_structures	26%	13%
applications	25%	0%
abstractions	37%	0%

The ordering holds: fundamentals is easiest, then data structures, then everything else. But applications and abstractions both achieve 0% run rate — not a smooth gradient but a sharp cliff. The model can't produce correct output for complex workflows or higher-order functions at all, on the 7B baseline.

The syntax/run gap for abstractions (37% syntax, 0% run) is particularly striking. The model generates syntactically valid Synoema code — it knows what pipe operators and lambdas look like — but the semantics are wrong. It's learned the shape without the meaning.

Per-task examples confirmed this pattern. The hardest tasks had specific characteristics: terse instructions without explicit call sites (model must infer the main expression), Synoema-specific operators without analogues in common languages (pipe, ternary chains), and pattern matching with literal patterns rather than guards.

H5: Architecture Matters — CONFIRMED

At similar parameter counts, architecture and training objective (code-specialized vs general) produce significantly different results.

In the 7–24B tier (same inference cost range), we tested four models:

Model	Size	Type	Baseline run%
gemma-3-12b-it	12B	General	66%
mistral-small-24b	24B	General	50%
llama-3.1-8b	8B	General	11%
qwen2.5-coder-7b	7B	Code-specialized	41%

Gemma-3-12B achieves 66% run rate — the best result in this tier, better than the 24B Mistral model despite having half the parameters. Architecture matters more than size within this range.

The code-specialized Qwen2.5-Coder-7B achieves 41% run rate on baseline, better than the general Llama-3.1-8B (30%) despite fewer parameters. Code specialization provides a meaningful advantage on Synoema generation — plausibly because the training distribution included more structured, typed code that shares patterns with Synoema's syntax.

Practical recommendation: in the 7–24B tier, prefer Gemma-3-12B. For cost-constrained deployment at 7B, prefer Qwen2.5-Coder-7B over general-purpose alternatives.

The Per-Task Breakdown

The aggregate numbers obscure important variation. Looking at individual tasks on the 3B model (baseline):

Task	Syntax%	Run%	Note
factorial	100%	100%	Trivial recursion, solved by all models
fizzbuzz	100%	100%	Conditional + string, well-represented in training
fibonacci	91%	91%	Near-perfect on small models
binary_search	100%	75%	Logic errors on edge cases
filter_map	91%	75%	List comprehension works, complex composition fails
quicksort	16%	16%	Divide-and-conquer fails on small models
pattern_match	91%	0%	Syntax ok, semantic failure on ADT matching
type_definition	91%	0%	Syntax ok, logic wrong

The pattern_match and type_definition rows are the most interesting. 91% syntax success means the model can write Synoema syntax for these patterns — it knows what ADT definitions and pattern matching look like. But 0% run rate means it consistently gets the semantics wrong. It's producing Synoema-shaped code that doesn't do the right thing.

This is a qualitatively different failure mode from quicksort (16% syntax), where the model fails to produce the language syntax at all. For ADT tasks, the model has learned the form but not the content. Fine-tuning on correct examples for these specific patterns should help — that's what the corpus and fine-tuning experiment are designed to test.

What This Means for Fine-Tuning

The Phase D results shape our fine-tuning strategy directly:

Target the 0%-run categories. Applications and abstractions reach 0% run rate on the baseline 7B model. The corpus specifically includes these categories, and fine-tuning with correct examples in these categories is the intervention most likely to close the gap.

Use the baseline prompt, not compact. H1 is disproved — compact hurts. Our fine-tuning evaluation uses the full 1800-token baseline reference.

Prefer code-specialized base models. H5 is confirmed — starting from Qwen2.5-Coder for fine-tuning gives a better starting point than a general-purpose model of the same size.

Measure per category, not just aggregate. Aggregate run rate conceals the fact that some categories are solved (factorial: 100%) and others are completely open (abstractions: 0%). Category-level metrics are more actionable.

The fine-tuning benchmarks are pending as of this writing. We'll report the results — including the failures — in a follow-up article.

Research

What We Learned Teaching AI a New Language

The Setup

H1: Compact Reference Helps — DISPROVED

H2: Larger Models Are Better — CONFIRMED

H3: Multipass Self-Correction — MODEL-SIZE-DEPENDENT

H4: Feature Difficulty — CONFIRMED

H5: Architecture Matters — CONFIRMED

The Per-Task Breakdown

What This Means for Fine-Tuning

Related Articles

The Scientific Method Behind Synoema

From Zero to 41%: Building an AI That Writes Working Code

Why AI Writes Broken Code — and How Type Systems Can Fix It