OpenEnv + CadQuery + GRPO

CADForge RLVE

Frontier models can describe CAD, but tiny models often fail at executable, editable CADQuery. CADForge turns that gap into an RL environment: write code, compile real geometry, receive reward, repair, and improve.

Run CAD demo Read mini-blog Detailed blog Training code Training scripts Gist Training logs Best trained model

Buildable CAD preview

Choose a task. CADForge will score a weak seed, repair it, verify it, and render the improved STL.

weak seed repaired CAD

Step 0Waiting for weak seed.

Step 1Waiting for repaired CAD.

build--

reward--

editability--

semantic--

The environment fights back

CADForge is not a static benchmark. The first dense GRPO run exposed a reward flaw: the model could receive partial reward while still failing to build. The environment adapted. Buildability became the first gate, failed code became negative reward, and each failure type became a curriculum target.

1. Observe failure

SyntaxError, missing fixture, invented API, disconnected parts, clipped final union, weak semantic match.

2. Generate curriculum

Failed trajectories become new repair tasks: fix one concrete CAD failure and improve the reward delta.

3. Train harder rollouts

GRPO groups compare buildable vs broken candidates, giving the model clean advantage signals.

Real training evidence

We ran seven distinct training experiments on RunPod H200. The important story is not just that loss went down; it is that the environment exposed reward hacking, then build-gated GRPO and adaptive repair made buildable CAD separate from broken code.

Run	What happened	Lesson
1. Qwen3.5-2B SFT	train loss 1.4480 to 0.1658; eval loss 0.4477 to 0.2676	2B learned CadQuery grammar and trace format.
2. Qwen3.5-2B dense GRPO	160 completions; 0.0% build rate; mean/best reward 0.3387 / 0.5303	Reward was learnable, but too hackable without a hard build gate.
3. Qwen3.5-9B SFT	train loss 2.6020 to 0.1413; eval loss 0.3650 to 0.2398	9B learned syntax and structure faster than 2B.
4. Qwen3.5-9B dense GRPO	160 completions; 0.0% build rate; mean/best reward 0.4355 / 0.6828	Bigger model got higher scalar reward while still failing buildability.
5. Qwen3.5-9B strict GRPO	320 completions; 96 buildable; best CADForge score 0.9352	Buildability-first reward produced the first real breakthrough.
6. Adaptive repair v1	120 repair completions; 0 buildable; clipped-output pattern exposed	The environment found a curriculum/completion-length bug.
7. Adaptive repair final 8192	180 repair completions; 53 buildable; 0 clipped completions; best reward 0.882	Failure mining plus longer completions recovered buildable repairs.

Best downloadable model adapter

Use the final Qwen3.5-9B adaptive-repair LoRA to test CADQuery generation and repair locally or on a GPU notebook.

Download best model

Held-out eval after strict GRPO built 2 of 3 generated CadQuery files successfully. The remaining failed chair case clipped before the final assembly, which directly motivated the adaptive repair run.

Reward hacking and reward design

CADForge started with dense rewards for code shape, semantic words, topology, contact, reference similarity, and editability. Training showed a classic reward-hacking pattern: models could earn positive-looking reward while still producing non-buildable CAD. The fix was to make buildability the first gate.

What was hackable

Dense reward gave partial credit for code-like text, named parts, and semantic hints even when CadQuery failed to export an STL.

What fixed it

Strict GRPO makes failed builds negative. Dense topology, semantic, contact, reference, and editability scores unlock only after the CAD builds.

Step rewards

Each action returns reward JSON: build, topology, contact, semantic parts, reference similarity, editability, efficiency, and verifier notes.

{
  "build": 1.0,
  "topology": 0.82,
  "contact": 0.74,
  "semantic_parts": 0.61,
  "reference_similarity": 0.58,
  "editability": 0.80,
  "total": 0.86,
  "notes": ["candidate builds", "recognizable task parts", "clean fixture"]
}

Space APIs

The Space is both a demo and an OpenEnv-style reward service. A model can submit CadQuery code, receive structured observations, and use those step rewards for SFT data generation, GRPO rollouts, or human-readable debugging.

GET /healthzHealth check for the CADForge Space.

POST /api/space/repair-loopRuns the demo loop: weak seed, repaired CAD, CadQuery build, reward JSON, and STL artifact URLs.

POST /api/space/demoScores a known buildable candidate and returns reward dimensions plus artifact paths.

GET /api/space/loop-stl/{task_id}Downloads the repaired STL from the most recent repair-loop run.

GET /api/space/loop-stl/{task_id}/{step_id}Downloads a specific weak-seed or repaired-step STL for visual comparison.

OpenEnv step routeThe OpenEnv server wraps complete CadQuery Python files as actions and returns observations with reward JSON and verifier notes.

Theme alignment

Long-horizon planning

CAD is built through repeated code edits, reward observations, and repairs rather than one-shot text generation.

Professional world modeling

The agent interacts with real CadQuery execution, STL export, mesh checks, reference metrics, and persistent state.

Self-improvement

The curriculum adapts to model failures: build errors and weak semantics become the next tasks the model must learn to repair.

Reward JSON will appear here after the repair loop runs.