Intermediate Results: v6 Fine-Tuning

These are intermediate results — not a final announcement. We publish them now because honest intermediate data is more useful than polished end results that arrive six months later. The numbers are real, the regression is real, and the investigation is ongoing.

What We Did in v6

v6 represents the first major format change in our fine-tuning pipeline: we switched from Alpaca format to ChatML format, increased corpus size from 5,037 to 5,907 validated examples, and trained on all three model sizes simultaneously (1.5B, 3B, 7B).

The corpus expansion focused on semantic depth: error chains, data pipelines, ADT state machines, and contract-based tests. Generation was done via OpenRouter (Gemma 3 12B) with compiler validation — every example must execute with returncode=0 in under 10 seconds.

v6 Results: 7B Model

The 7B model (Qwen2.5-Coder-7B-Instruct) is the first fully evaluated v6 model. Results from the Phase D benchmark (9 standard tasks, 5 repeats, pass@1 at temperature=0):

MetricBaseline 7BFine-tuned 7B v6Delta
syntax_pass56%100%+44pp
run_pass41%90.5%+49.5pp
constructs_pass44.6%−8.1pp vs v5

The run rate improvement is the main result: +49.5 percentage points. The model went from generating syntactically broken or runtime-failing code 59% of the time to generating runnable code 90.5% of the time.

The Regression: Constructs

There is a problem. The constructs pass rate — which measures whether the model uses the specific Synoema language constructs asked for in the task — dropped from 52.7% (v5) to 44.6% (v6).

A program that runs but avoids the constructs it was asked to use is not the goal. If the task says "use |> pipe chains", we want pipe chains. If the task says "use and_then combinator", we want and_then, not manual pattern matching that happens to produce the same result.

The regression is real. We're not minimizing it.

Failure Analysis

From 40 semantic evaluation tasks, the most common construct failures were:

Why ChatML May Be the Cause

The format change from Alpaca to ChatML is the most significant variable between v5 and v6. Our current hypothesis (H14) is that ChatML improved the model's ability to generate runnable code — it "speaks" Synoema more fluently — but changed how it interprets construct-specific task instructions.

Three sub-hypotheses under investigation:

Status of 3B and 1.5B

At the time of writing:

We will update this page when those results are available. Based on the semantic eval subset (40 tasks), the 3B v6 model shows run_pass 97.5% and constructs_pass 90.0% — both stronger than the 7B on this specific benchmark. We are running the full Phase D benchmark to confirm.

Is This a "Production" Model?

No. By our internal production criteria (RULES.md §7):

The constructs regression prevents production classification. The 7B v6 model is a research artifact with a known weakness, not a production release.

What Happens Next

v7 corpus work is planned to address the regression:

  1. Add 100+ test declaration examples (currently 37)
  2. Add 50+ pipe chain examples with complex data (pairs, records, nested transforms)
  3. Add constructor naming discipline examples — following task-specified names
  4. Add bind_maybe combinator preference examples
  5. Consider controlled A/B: Alpaca v7 vs ChatML v7 to isolate format effect

We will also run baseline models in ChatML format before making v5 vs v6 comparisons — the format switch makes direct metric comparisons potentially invalid (apples vs oranges).

What the Numbers Mean

A run_pass rate of 90.5% means that if you ask the v6 7B model to write a Synoema program, it will produce working, runnable code 9 times out of 10. That is a massive improvement over the 41% baseline.

A constructs_pass rate of 44.6% means that in fewer than half of those cases, it used the specific language constructs the task asked for. For strict instruction-following (e.g., "write this using |> and and_then"), that is a problem.

For the use case of "generate working Synoema code given a description" — the model is very good now. For the use case of "generate idiomatic, construct-specific Synoema code" — the model needs more work.

Raw Data

All eval results are versioned in the repository under research/finetune/eval/results/. The v6 semantic evaluation (2026-04-13) is at:

All numbers on this page come from those files.