From Zero to 41%: Building an AI That Writes Working Code

Results April 13, 2026 · ~11 min read

The 41% number is the one people ask about. It sounds low — and it is — but understanding what it means requires understanding where we started, what we did to get there, and what "working" means in this context.

This is the story of building a domain-specific code generation pipeline from scratch: language design, corpus creation, fine-tuning, and honest evaluation. Including what didn't work.

Where We Started: The Blank Slate Problem

When you build a new programming language, no existing AI model knows how to write it. There are no Stack Overflow answers, no GitHub repositories, no training examples in any public dataset. Every model that has ever been trained has zero exposure to Synoema.

This is both the challenge and the scientific interest. We can measure exactly how LLMs learn a new language from scratch — using only in-context documentation, no prior exposure — and we can measure how that changes with fine-tuning. We're not asking "can LLMs write Python?" We're asking "can LLMs learn any language if you design it right?"

The baseline, with zero fine-tuning, is what Phase D measured: a 7B model with our full ~1800-token reference in the prompt achieves 41% run rate on the standard task set. A 70B model achieves 61%. A 1B model achieves 0%.

Zero fine-tuning. Just documentation in the context window. The 41% baseline is purely in-context learning — the model absorbing syntax rules and examples from the prompt and applying them to new tasks.

Building the Corpus

Fine-tuning a model on a new language requires training examples — correct programs that demonstrate the language's idioms. For Synoema, we couldn't source these from the web. We had to generate them.

The approach: use a capable general-purpose model (Gemma 3 12B via OpenRouter) to generate candidate Synoema programs, then validate each one by actually running it. Programs that parse, type-check, and produce correct output are kept. Programs that fail are discarded.

The result: from 5,041 generated candidates, 5,037 passed validation — a 99.9% pass rate. This high pass rate reflects the quality of the generating model and the clarity of the task descriptions, not luck. Each task has an explicit expected output that the compiler can verify automatically.

The corpus covers five categories:

Category	Description	Examples
fundamentals	Arithmetic, recursion, basic I/O, guards	~1500
data_structures	Lists, tuples, records, trees, ADTs	~1200
abstractions	Higher-order functions, lambdas, pipes	~1000
applications	Sorting, searching, multi-function programs	~800
string_ops	String manipulation, formatting, parsing	~537

Category distribution was chosen based on Phase D findings: fundamentals and data structures are underperforming but learnable; applications and abstractions are at 0% run rate and most in need of improvement.

An additional 79 dialogue examples were added for the conversational assistant variant: 20 "explain this syntax" pairs, 20 debug-fix pairs, 20 Q&A pairs, and 19 code review pairs. Total corpus for the assistant model: 5,116 examples.

The Fine-Tuning Setup

We trained on a consumer AMD GPU: an AMD Radeon RX 7900 GRE with 16 GB VRAM, running ROCm 6.4 on Ubuntu 22.04. Not a cloud cluster, not an A100 farm — hardware that a small research team can actually own and operate.

Training method: QLoRA (Quantized Low-Rank Adaptation). LoRA adds a small number of trainable parameters on top of a frozen base model; the "Q" means the base model weights are quantized to reduce memory usage. For 1.5B and 3B models, we used bf16 precision with LoRA rank 16 and alpha 32. For 7B, 4-bit NF4 quantization to fit in 16 GB VRAM.

Hyperparameters:

Parameter	Value
Optimizer	AdamW
LR scheduler	Cosine decay
Learning rate	2e-4
Epochs	3
Effective batch size	16
Warmup steps	20
Max sequence length	512 tokens
LoRA rank / alpha	16 / 32
LoRA target modules	q, k, v, o, gate, up, down projections

Training Results: What the Numbers Look Like

The 1.5B model (Qwen2.5-Coder-1.5B-Instruct) trained in 34 minutes and 18 seconds. The 3B model trained in 45 minutes and 23 seconds. Both on a consumer GPU in a home lab.

Model	Train loss	Token accuracy	Runtime
1.5B (Qwen2.5-Coder)	0.3261	91.4%	34 min 18 s
3B (Qwen2.5-Coder)	0.3249	91.5%	45 min 23 s
7B (Qwen2.5-Coder)	—	—	training

91.4% token accuracy means that on the training set, the model correctly predicts the next token 91.4% of the time. The loss of 0.326 is within expected range for this task. Both models show similar convergence curves: rapid improvement in epoch 1, slower gains in epochs 2 and 3.

These are training-time metrics. They tell us the model learned something — it can reproduce training examples with high fidelity. Whether it generalizes to new Synoema tasks requires benchmark evaluation, which is pending as of this writing.

Status note: The 7B model training and all benchmark evaluations are pending. This article reports training results. Run-rate results for fine-tuned models will be reported separately once benchmarks complete.

The 41% Number in Context

The 41% run rate that gives this article its title comes from the baseline (no fine-tuning) 7B model on the standard 9-task benchmark: qwen2.5-coder-7b achieves 41% run rate with the full baseline prompt. This is the starting point, not the fine-tuned result.

For comparison:

70B model (llama-3.3-70b), no fine-tuning: 61% run rate
7B model (qwen2.5-coder-7b), no fine-tuning: 41% run rate
3B model (qwen2.5-coder-3b), no fine-tuning: 60% run rate on standard tasks, 12% on 50-task corpus
1B model (llama-3.2-1b), no fine-tuning: 0% run rate

The 41% figure is specifically the 7B model on the harder task distribution. On the 9 standard tasks, it achieves higher — but the standard tasks include factorial, fibonacci, and fizzbuzz, which any model handles easily. The 50-task corpus includes harder tasks: abstract function composition, complex data structure operations, terse implicit-main programs.

What the 59% Failure Looks Like

Failures are not uniform. They fall into distinct categories:

Semantic failures on ADT tasks (pattern_match, type_definition): 91% syntax, 0% run. The model generates syntactically valid Synoema code for algebraic data types but gets the semantics wrong. It knows what ADT definitions look like but doesn't understand how to use them correctly. This is the target of the fine-tuning corpus — we have 1200+ ADT examples for training.

Complete syntax failures (quicksort, complex DS operations): 16% syntax. The model can't produce the correct Synoema syntax for divide-and-conquer algorithms at all at 3B scale. Larger models do better (7B achieves 56% syntax on quicksort), suggesting this is a capacity issue.

Terse instruction failures (implicit-main tasks): 0% on both syntax and run for tasks with very short prompts that require inferring the expected main expression. The model needs more explicit instruction. Corpus improvement targets these specifically: we add explicit call examples to terse task descriptions.

String operation failures: Below-average performance on string manipulation. Python's string idioms don't transfer — Synoema's string builtins (str_find, str_slice, str_split) have different names and semantics. The corpus includes 537 string operation examples to address this.

What Fine-Tuning Is Trying to Achieve

The core hypothesis (H6 in our test plan): fine-tuning on 5,037 validated Synoema programs raises the 7B model's run rate from 41% to 75% or higher on the standard task set.

More interesting is H7: can a fine-tuned 1.5B model exceed the baseline 7B model's 41% run rate? If yes, that's a significant result — it means that 5,037 Synoema-specific training examples can substitute for roughly 5x more parameters when it comes to generating correct Synoema code. That has practical implications: smaller, faster, cheaper models for deployment.

We don't know yet. The benchmarks will tell us.

What we do know: the training went smoothly, both models converged to similar loss values (0.3249–0.3261), and the token accuracy is high enough to suggest that the model has genuinely learned to reproduce Synoema syntax with high fidelity. Whether it generalizes to new tasks is the empirical question.

We'll publish the results when they're ready, whether they confirm or refute the hypothesis. That's the commitment.

Results

From Zero to 41%: Building an AI That Writes Working Code

Where We Started: The Blank Slate Problem

Building the Corpus

The Fine-Tuning Setup

Training Results: What the Numbers Look Like

The 41% Number in Context

What the 59% Failure Looks Like

What Fine-Tuning Is Trying to Achieve

Related Articles

What We Learned Teaching AI a New Language

The Scientific Method Behind Synoema

Why Build a New Programming Language in the Age of AI?