OpenEnv + CadQuery + GRPO
Frontier models can describe CAD, but tiny models often fail at executable, editable CADQuery. CADForge turns that gap into an RL environment: write code, compile real geometry, receive reward, repair, and improve.
Choose a task. CADForge will score a weak seed, repair it, verify it, and render the improved STL.
CADForge is not a static benchmark. The first dense GRPO run exposed a reward flaw: the model could receive partial reward while still failing to build. The environment adapted. Buildability became the first gate, failed code became negative reward, and each failure type became a curriculum target.
SyntaxError, missing fixture, invented API, disconnected parts, clipped final union, weak semantic match.
Failed trajectories become new repair tasks: fix one concrete CAD failure and improve the reward delta.
GRPO groups compare buildable vs broken candidates, giving the model clean advantage signals.
We ran seven distinct training experiments on RunPod H200. The important story is not just that loss went down; it is that the environment exposed reward hacking, then build-gated GRPO and adaptive repair made buildable CAD separate from broken code.
| Run | What happened | Lesson |
|---|---|---|
| 1. Qwen3.5-2B SFT | train loss 1.4480 to 0.1658; eval loss 0.4477 to 0.2676 | 2B learned CadQuery grammar and trace format. |
| 2. Qwen3.5-2B dense GRPO | 160 completions; 0.0% build rate; mean/best reward 0.3387 / 0.5303 | Reward was learnable, but too hackable without a hard build gate. |
| 3. Qwen3.5-9B SFT | train loss 2.6020 to 0.1413; eval loss 0.3650 to 0.2398 | 9B learned syntax and structure faster than 2B. |
| 4. Qwen3.5-9B dense GRPO | 160 completions; 0.0% build rate; mean/best reward 0.4355 / 0.6828 | Bigger model got higher scalar reward while still failing buildability. |
| 5. Qwen3.5-9B strict GRPO | 320 completions; 96 buildable; best CADForge score 0.9352 | Buildability-first reward produced the first real breakthrough. |
| 6. Adaptive repair v1 | 120 repair completions; 0 buildable; clipped-output pattern exposed | The environment found a curriculum/completion-length bug. |
| 7. Adaptive repair final 8192 | 180 repair completions; 53 buildable; 0 clipped completions; best reward 0.882 | Failure mining plus longer completions recovered buildable repairs. |
Use the final Qwen3.5-9B adaptive-repair LoRA to test CADQuery generation and repair locally or on a GPU notebook.
Held-out eval after strict GRPO built 2 of 3 generated CadQuery files successfully. The remaining failed chair case clipped before the final assembly, which directly motivated the adaptive repair run.
CADForge started with dense rewards for code shape, semantic words, topology, contact, reference similarity, and editability. Training showed a classic reward-hacking pattern: models could earn positive-looking reward while still producing non-buildable CAD. The fix was to make buildability the first gate.
Dense reward gave partial credit for code-like text, named parts, and semantic hints even when CadQuery failed to export an STL.
Strict GRPO makes failed builds negative. Dense topology, semantic, contact, reference, and editability scores unlock only after the CAD builds.
Each action returns reward JSON: build, topology, contact, semantic parts, reference similarity, editability, efficiency, and verifier notes.
{
"build": 1.0,
"topology": 0.82,
"contact": 0.74,
"semantic_parts": 0.61,
"reference_similarity": 0.58,
"editability": 0.80,
"total": 0.86,
"notes": ["candidate builds", "recognizable task parts", "clean fixture"]
}
The Space is both a demo and an OpenEnv-style reward service. A model can submit CadQuery code, receive structured observations, and use those step rewards for SFT data generation, GRPO rollouts, or human-readable debugging.
GET /healthzHealth check for the CADForge Space.POST /api/space/repair-loopRuns the demo loop: weak seed, repaired CAD, CadQuery build, reward JSON, and STL artifact URLs.POST /api/space/demoScores a known buildable candidate and returns reward dimensions plus artifact paths.GET /api/space/loop-stl/{task_id}Downloads the repaired STL from the most recent repair-loop run.GET /api/space/loop-stl/{task_id}/{step_id}Downloads a specific weak-seed or repaired-step STL for visual comparison.OpenEnv step routeThe OpenEnv server wraps complete CadQuery Python files as actions and returns observations with reward JSON and verifier notes.CAD is built through repeated code edits, reward observations, and repairs rather than one-shot text generation.
The agent interacts with real CadQuery execution, STL export, mesh checks, reference metrics, and persistent state.
The curriculum adapts to model failures: build errors and weak semantics become the next tasks the model must learn to repair.
Reward JSON will appear here after the repair loop runs.