Functional RTL generation: A study in curriculum learning and reward shaping with limited data
Large language models (LLMs) have become capable software engineers, yet chip design remains a frontier where they struggle. Writing correct register-transfer-level (RTL) hardware is unforgiving: a design is judged not by how plausible it looks but by whether it compiles, simulates, and passes a testbench, and high-quality training data is comparatively scarce. The difficulty is real even for the strongest models: at the time of our experiments, the best frontier model we tested, Claude Opus 4.6, passed under 60% of the public CVDP benchmark1 given a single attempt per problem (pass@1). Those same properties, however, make hardware an unusually clean setting for reinforcement learning, because correctness is checkable by construction rather than by a learned judge.
In this research effort, we target single-turn Verilog generation: given an RTL design specification, the model produces a complete design in a single response. The goal was to understand which levers matter under tight data, so almost everything here is reward and curriculum work on a finite problem set rather than a maximal-effort capability push. Starting from an open base model (Kimi-K2.5), we post-train it with reinforcement learning from verifiable rewards (RLVR) rather than supervised imitation, so that the model learns directly from the syntax and functional correctness of its designs. We call the resulting model Architect v0.1, and in the rest of this post we describe the reward and curriculum choices that enabled it to match a frontier model on the CVDP benchmark.
We optimize the policy model with group-relative policy optimization (GRPO)2 and importance sampling. Because GRPO computes each rollout’s advantage from the spread of rewards within its group, a group yields a learning signal only when its rollouts disagree: if every rollout passes, or every rollout fails, the within-group advantages vanish and the group contributes no gradient. We track two signals throughout training: the functional pass rate, a direct measure of how often generated designs are correct, and policy entropy, which we find indicates generalization beyond the training mix more reliably than the training reward does.
Verifiable rewards
Chip design provides a rich set of verifiable rewards, and we have iterated extensively on how to shape them. Because these rewards are produced by executing and validating the design rather than by a learned judge, they are far more robust to reward hacking than preference-based rewards. Specifically for non-agentic Verilog generation, we use a functional reward computed from functional verification of the generated code:
The maximum reward is reached only when the design passes overall and all test cases pass; a failing design forfeits the final-pass term and is capped well below the maximum. This gives partial credit while preserving a large bonus for full correctness, and the signal is fact-based and cannot be gamed by the model.
To provide denser gradient on Verilog code structure and syntax, we also experimented with a combined functional-and-structural reward in a multi-environment RL setup:
The structural term is gated by a syntax-validity check and mixes multiple normalized code-quality signals:
where each \(s_i \in [0, 1]\) captures one aspect of code quality, such as completeness, structural quality, code length, and n-gram overlap. Even late in training, a large fraction of batches still contained trajectories with near-zero functional reward; the structural term kept a useful gradient flowing by grading those completions on structural quality, separating malformed outputs from syntactically valid ones.
Curriculum learning
A recurring challenge in this setting is sparse, delayed, and stochastic feedback, compounded by overfitting on finite problem sets. We observed that increasing the number of training epochs did not break the model’s performance ceiling; instead, prolonged training led the runs to overfit, with policy entropy collapsing toward zero as the learning signal concentrated on fewer and fewer problems. We also found that policy entropy tracks evaluation pass rate even more closely than training reward does, which makes preserving exploration a central RL design goal.
In light of these observations, we built a curriculum over the training problems. We bucket each problem into an empirical pass-rate tier, ranging from easy to hard, and schedule a differently weighted tier-mix at each curriculum stage to keep the gradient concentrated on problems the policy can still learn from. Combining a soft focus on the learnable tiers, an initial warmup on the easy tiers, an adaptive learning rate (LR), and a refreshed optimizer after each LR transition, this curriculum broke the prior ceiling and delivered our best confirmed checkpoint. Throughout RL, we also actively monitored the KL divergence from the reference policy as a leading indicator of the policy drift that precedes overfitting and training collapse.
Results on non-agentic CVDP
We evaluate the post-trained model on the non-agentic CVDP benchmark, reporting each model’s pass@1 rate as the average across multiple evaluation runs. Our best checkpoint matches Claude Opus 4.6 and outperforms GPT-5.4, starting from a Kimi-K2.5 base. All numbers are measured in our own sampling and evaluation environment under matched conditions, including a fixed token budget that is sufficient for a complete Verilog design but not unlimited. We do not use any agent harness or test-time scaling for this evaluation, so this result isolates the intrinsic RTL design ability we are building into the model itself.
The gains came from the reward and curriculum work rather than from simply adding data. As the figure shows, the curriculum lifted the model from a pre-curriculum baseline around 52.5% to 54.5%, breaking the earlier ceiling primarily by improving a difficulty tier that earlier stages never moved.
Next steps
Our early results point to several directions we are actively pursuing. Some of which are listed below:
Scaling and diversifying training data. Our experiments point to the finite, insufficiently diverse problem set as the main bottleneck: modestly enlarging it did not break the ceiling, because the added problems were not diverse enough. A substantially larger and more diverse training mix is the most direct way to sustain exploration and push past the current ceiling.
Specializing and merging experts. Rather than train a single model to do everything, we are exploring post-training narrow expert checkpoints for individual sub-skills and then merging their capabilities back into one model through techniques like on-policy distillation.
Agentic reinforcement learning. Single-response generation is only the beginning. We are extending RLVR into agentic settings beyond CVDP, where the model works over many turns, calling simulators, waveform viewers, and formal tools, reading their feedback, and revising its design until it converges, rewarded for the final verified outcome rather than a one-shot guess. Scaling this asynchronous, multi-environment agentic training, and teaching the model to drive the same EDA toolchains a hardware engineer uses, is a primary lever for improving long-horizon reasoning under sparse, delayed feedback.
- 1. Pinckney et al. (2025), “Comprehensive Verilog Design Problems: A Next-Generation Benchmark Dataset for Evaluating Large Language Models and Agents on RTL Design and Verification.” arXiv:2506.14074. ↩
- 2. Shao et al. (2024), “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.” arXiv:2402.03300. ↩