From Zero to 41%: Building an AI That Writes Working Code
The 41% number is the one people ask about. It sounds low — and it is — but understanding what it means requires understanding where we started, what we did to get there, and what "working" means in this context.
This is the story of building a domain-specific code generation pipeline from scratch: language design, corpus creation, fine-tuning, and honest evaluation. Including what didn't work.
Where We Started: The Blank Slate Problem
When you build a new programming language, no existing AI model knows how to write it. There are no Stack Overflow answers, no GitHub repositories, no training examples in any public dataset. Every model that has ever been trained has zero exposure to Synoema.
This is both the challenge and the scientific interest. We can measure exactly how LLMs learn a new language from scratch — using only in-context documentation, no prior exposure — and we can measure how that changes with fine-tuning. We're not asking "can LLMs write Python?" We're asking "can LLMs learn any language if you design it right?"
The baseline, with zero fine-tuning, is what Phase D measured: a 7B model with our full ~1800-token reference in the prompt achieves 41% run rate on the standard task set. A 70B model achieves 61%. A 1B model achieves 0%.
Zero fine-tuning. Just documentation in the context window. The 41% baseline is purely in-context learning — the model absorbing syntax rules and examples from the prompt and applying them to new tasks.
Building the Corpus
Fine-tuning a model on a new language requires training examples — correct programs that demonstrate the language's idioms. For Synoema, we couldn't source these from the web. We had to generate them.
The approach: use a capable general-purpose model (Gemma 3 12B via OpenRouter) to generate candidate Synoema programs, then validate each one by actually running it. Programs that parse, type-check, and produce correct output are kept. Programs that fail are discarded.
The result: from 5,041 generated candidates, 5,037 passed validation — a 99.9% pass rate. This high pass rate reflects the quality of the generating model and the clarity of the task descriptions, not luck. Each task has an explicit expected output that the compiler can verify automatically.
The corpus covers five categories:
| Category | Description | Examples |
|---|---|---|
| fundamentals | Arithmetic, recursion, basic I/O, guards | ~1500 |
| data_structures | Lists, tuples, records, trees, ADTs | ~1200 |
| abstractions | Higher-order functions, lambdas, pipes | ~1000 |
| applications | Sorting, searching, multi-function programs | ~800 |
| string_ops | String manipulation, formatting, parsing | ~537 |
Category distribution was chosen based on Phase D findings: fundamentals and data structures are underperforming but learnable; applications and abstractions are at 0% run rate and most in need of improvement.
An additional 79 dialogue examples were added for the conversational assistant variant: 20 "explain this syntax" pairs, 20 debug-fix pairs, 20 Q&A pairs, and 19 code review pairs. Total corpus for the assistant model: 5,116 examples.
The Fine-Tuning Setup
We trained on a consumer AMD GPU: an AMD Radeon RX 7900 GRE with 16 GB VRAM, running ROCm 6.4 on Ubuntu 22.04. Not a cloud cluster, not an A100 farm — hardware that a small research team can actually own and operate.
Training method: QLoRA (Quantized Low-Rank Adaptation). LoRA adds a small number of trainable parameters on top of a frozen base model; the "Q" means the base model weights are quantized to reduce memory usage. For 1.5B and 3B models, we used bf16 precision with LoRA rank 16 and alpha 32. For 7B, 4-bit NF4 quantization to fit in 16 GB VRAM.
Hyperparameters:
| Parameter | Value |
|---|---|
| Optimizer | AdamW |
| LR scheduler | Cosine decay |
| Learning rate | 2e-4 |
| Epochs | 3 |
| Effective batch size | 16 |
| Warmup steps | 20 |
| Max sequence length | 512 tokens |
| LoRA rank / alpha | 16 / 32 |
| LoRA target modules | q, k, v, o, gate, up, down projections |
Training Results: What the Numbers Look Like
The 1.5B model (Qwen2.5-Coder-1.5B-Instruct) trained in 34 minutes and 18 seconds. The 3B model trained in 45 minutes and 23 seconds. Both on a consumer GPU in a home lab.
| Model | Train loss | Token accuracy | Runtime |
|---|---|---|---|
| 1.5B (Qwen2.5-Coder) | 0.3261 | 91.4% | 34 min 18 s |
| 3B (Qwen2.5-Coder) | 0.3249 | 91.5% | 45 min 23 s |
| 7B (Qwen2.5-Coder) | — | — | training |
91.4% token accuracy means that on the training set, the model correctly predicts the next token 91.4% of the time. The loss of 0.326 is within expected range for this task. Both models show similar convergence curves: rapid improvement in epoch 1, slower gains in epochs 2 and 3.
These are training-time metrics. They tell us the model learned something — it can reproduce training examples with high fidelity. Whether it generalizes to new Synoema tasks requires benchmark evaluation, which is pending as of this writing.
The 41% Number in Context
The 41% run rate that gives this article its title comes from the baseline (no fine-tuning) 7B model on the standard 9-task benchmark: qwen2.5-coder-7b achieves 41% run rate with the full baseline prompt. This is the starting point, not the fine-tuned result.
For comparison:
- 70B model (llama-3.3-70b), no fine-tuning: 61% run rate
- 7B model (qwen2.5-coder-7b), no fine-tuning: 41% run rate
- 3B model (qwen2.5-coder-3b), no fine-tuning: 60% run rate on standard tasks, 12% on 50-task corpus
- 1B model (llama-3.2-1b), no fine-tuning: 0% run rate
The 41% figure is specifically the 7B model on the harder task distribution. On the 9 standard tasks, it achieves higher — but the standard tasks include factorial, fibonacci, and fizzbuzz, which any model handles easily. The 50-task corpus includes harder tasks: abstract function composition, complex data structure operations, terse implicit-main programs.
What the 59% Failure Looks Like
Failures are not uniform. They fall into distinct categories:
Semantic failures on ADT tasks (pattern_match, type_definition): 91% syntax, 0% run. The model generates syntactically valid Synoema code for algebraic data types but gets the semantics wrong. It knows what ADT definitions look like but doesn't understand how to use them correctly. This is the target of the fine-tuning corpus — we have 1200+ ADT examples for training.
Complete syntax failures (quicksort, complex DS operations): 16% syntax. The model can't produce the correct Synoema syntax for divide-and-conquer algorithms at all at 3B scale. Larger models do better (7B achieves 56% syntax on quicksort), suggesting this is a capacity issue.
Terse instruction failures (implicit-main tasks): 0% on both syntax and run for tasks with very short prompts that require inferring the expected main expression. The model needs more explicit instruction. Corpus improvement targets these specifically: we add explicit call examples to terse task descriptions.
String operation failures: Below-average performance on string manipulation. Python's string idioms don't transfer — Synoema's string builtins (str_find, str_slice, str_split) have different names and semantics. The corpus includes 537 string operation examples to address this.
What Fine-Tuning Is Trying to Achieve
The core hypothesis (H6 in our test plan): fine-tuning on 5,037 validated Synoema programs raises the 7B model's run rate from 41% to 75% or higher on the standard task set.
More interesting is H7: can a fine-tuned 1.5B model exceed the baseline 7B model's 41% run rate? If yes, that's a significant result — it means that 5,037 Synoema-specific training examples can substitute for roughly 5x more parameters when it comes to generating correct Synoema code. That has practical implications: smaller, faster, cheaper models for deployment.
We don't know yet. The benchmarks will tell us.
What we do know: the training went smoothly, both models converged to similar loss values (0.3249–0.3261), and the token accuracy is high enough to suggest that the model has genuinely learned to reproduce Synoema syntax with high fidelity. Whether it generalizes to new tasks is the empirical question.
We'll publish the results when they're ready, whether they confirm or refute the hypothesis. That's the commitment.
Related Articles
What We Learned Teaching AI a New Language
Phase D baseline results in full: 5 hypotheses, 10+ models, surprising findings.
ResearchThe Scientific Method Behind Synoema
The full hypothesis framework (H1–H12), statistical methodology, and evaluation protocol.
ExplainerWhy Build a New Programming Language in the Age of AI?
The motivation and design philosophy behind Synoema.