OpenEnv + CadQuery + GRPO

CADForge RLVE

Frontier models can describe CAD, but tiny models often fail at executable, editable CADQuery. CADForge turns that gap into an RL environment: write code, compile real geometry, receive reward, repair, and improve.

Buildable CAD preview

Choose a task. CADForge will score a weak seed, repair it, verify it, and render the improved STL.

weak seed repaired CAD
Step 0Waiting for weak seed.
Step 1Waiting for repaired CAD.
build--
reward--
editability--
semantic--

Judge rerun links

The full CADForge SFT and GRPO runs were executed on a RunPod H200 as distinct production scripts. The Colab notebook is the public smoke path: it validates OpenEnv, loads the public dataset, runs the real CadQuery reward backend, and launches tiny SFT/GRPO checks with the same source files.

The environment fights back

CADForge is not a static benchmark. The first dense GRPO run exposed a reward flaw: the model could receive partial reward while still failing to build. The environment adapted. Buildability became the first gate, failed code became negative reward, and each failure type became a curriculum target.

1. Observe failure

SyntaxError, missing fixture, invented API, disconnected parts, clipped final union, weak semantic match.

2. Generate curriculum

Failed trajectories become new repair tasks: fix one concrete CAD failure and improve the reward delta.

3. Train harder rollouts

GRPO groups compare buildable vs broken candidates, giving the model clean advantage signals.

Real training evidence

We ran seven distinct training experiments on RunPod H200. The important story is not just that loss went down; it is that the environment exposed reward hacking, then build-gated GRPO and adaptive repair made buildable CAD separate from broken code.

RunWhat happenedLesson
1. Qwen3.5-2B SFTtrain loss 1.4480 to 0.1658; eval loss 0.4477 to 0.26762B learned CadQuery grammar and trace format.
2. Qwen3.5-2B dense GRPO160 completions; 0.0% build rate; mean/best reward 0.3387 / 0.5303Reward was learnable, but too hackable without a hard build gate.
3. Qwen3.5-9B SFTtrain loss 2.6020 to 0.1413; eval loss 0.3650 to 0.23989B learned syntax and structure faster than 2B.
4. Qwen3.5-9B dense GRPO160 completions; 0.0% build rate; mean/best reward 0.4355 / 0.6828Bigger model got higher scalar reward while still failing buildability.
5. Qwen3.5-9B strict GRPO320 completions; 96 buildable; best CADForge score 0.9352Buildability-first reward produced the first real breakthrough.
6. Adaptive repair v1120 repair completions; 0 buildable; clipped-output pattern exposedThe environment found a curriculum/completion-length bug.
7. Adaptive repair final 8192180 repair completions; 53 buildable; 0 clipped completions; best reward 0.882Failure mining plus longer completions recovered buildable repairs.
Best downloadable model adapter

Use the final Qwen3.5-9B adaptive-repair LoRA to test CADQuery generation and repair locally or on a GPU notebook.

Download best model

Held-out eval after strict GRPO built 2 of 3 generated CadQuery files successfully. The remaining failed chair case clipped before the final assembly, which directly motivated the adaptive repair run.

Reward hacking and reward design

CADForge started with dense rewards for code shape, semantic words, topology, contact, reference similarity, and editability. Training showed a classic reward-hacking pattern: models could earn positive-looking reward while still producing non-buildable CAD. The fix was to make buildability the first gate.

What was hackable

Dense reward gave partial credit for code-like text, named parts, and semantic hints even when CadQuery failed to export an STL.

What fixed it

Strict GRPO makes failed builds negative. Dense topology, semantic, contact, reference, and editability scores unlock only after the CAD builds.

Step rewards

Each action returns reward JSON: build, topology, contact, semantic parts, reference similarity, editability, efficiency, and verifier notes.

{
  "build": 1.0,
  "topology": 0.82,
  "contact": 0.74,
  "semantic_parts": 0.61,
  "reference_similarity": 0.58,
  "editability": 0.80,
  "total": 0.86,
  "notes": ["candidate builds", "recognizable task parts", "clean fixture"]
}

Space APIs

The Space is both a demo and an OpenEnv-style reward service. A model can submit CadQuery code, receive structured observations, and use those step rewards for SFT data generation, GRPO rollouts, or human-readable debugging.

GET /healthzHealth check for the CADForge Space.
POST /api/space/repair-loopRuns the demo loop: weak seed, repaired CAD, CadQuery build, reward JSON, and STL artifact URLs.
POST /api/space/demoScores a known buildable candidate and returns reward dimensions plus artifact paths.
GET /api/space/loop-stl/{task_id}Downloads the repaired STL from the most recent repair-loop run.
GET /api/space/loop-stl/{task_id}/{step_id}Downloads a specific weak-seed or repaired-step STL for visual comparison.
OpenEnv step routeThe OpenEnv server wraps complete CadQuery Python files as actions and returns observations with reward JSON and verifier notes.

Theme alignment

Long-horizon planning

CAD is built through repeated code edits, reward observations, and repairs rather than one-shot text generation.

Professional world modeling

The agent interacts with real CadQuery execution, STL export, mesh checks, reference metrics, and persistent state.

Self-improvement

The curriculum adapts to model failures: build errors and weak semantics become the next tasks the model must learn to repair.

Reward JSON will appear here after the repair loop runs.