LeWorldModel on Apple Silicon: TwoRoom Benchmark on Mac mini M4

For years, the practical frontier of AI research has felt like a gated community. Many papers are developed, trained, and validated inside Linux/CUDA environments backed by powerful NVIDIA GPUs, custom dependency stacks, and lab infrastructure that most developers do not have at home. The result is a reproducibility gap: even when a paper is public and the code is available, verifying the result locally can still be difficult.

I wanted to test how wide that gap really is for a modern world model. The target was LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels, usually abbreviated as LeWM. The original code lives at github.com/lucas-maes/le-wm, and this Apple Silicon adaptation lives in my fork at github.com/carlosap78/le-wm.

The question was not whether a Mac mini M4 can replace a research workstation. It cannot, and that is not the point. The question was more practical: can a small consumer desktop with an Apple M4 chip, 10 CPU cores, a 10-core integrated GPU, and 16 GB of memory run a real LeWorldModel evaluation and reproduce the paper’s TwoRoom success rate closely enough to be meaningful?

The answer was yes, with caveats. On the paper-like TwoRoom evaluation, the original paper reports an 87% LeWM success rate. The local Mac mini M4 run reached 86%, solving 43 out of 50 evaluation cases.

Why LeWorldModel Is Interesting

LeWorldModel is a world model built around a Joint-Embedding Predictive Architecture, or JEPA. That distinction matters because LeWM is not trying to generate a perfect pixel-by-pixel reconstruction of the future. It learns a compact latent representation of the environment and predicts how that representation changes under candidate actions.

That makes the model useful for planning. Instead of blindly trying actions in the real environment, an agent can use the model to evaluate possible futures internally. A planner can then search over candidate action sequences and select the one most likely to move the agent toward the goal.

In plain terms, the model is not asking, «What should every future pixel look like?» It is asking, «What future state does this action sequence lead to, and is that state closer to the goal?»

That is closer to how humans often reason about movement. When you walk through a room, you are not rendering a perfect 4K movie in your head. You are tracking where you are, where the obstacles are, and which actions are likely to get you where you want to go.

The TwoRoom Reproducibility Test

The benchmark I focused on was TwoRoom. In this environment, an agent must navigate a layout with two rooms and a connecting passage to reach a target. The evaluation is more meaningful than simply checking whether the model loads. It requires the checkpoint, dataset, environment, world model, and planner to work together.

The local run used the converted TwoRoom checkpoint and the default evaluation settings from this fork:

python eval.py --config-name=tworoom.yaml \
  policy=tworoom/lewm \
  +device=mps \
  solver.device=mps

The evaluation ran 50 cases with CEM planning. The key settings were 300 solver samples, 30 optimization steps, top-k 30, a planning horizon of 5, and an evaluation budget of 50.

The result was the central finding of the experiment:

Source	TwoRoom LeWM success rate
Paper Figure 6	87%
Local Mac mini M4 run	86%

The local machine solved 43 out of 50 cases. That one percentage point gap is the important part. It suggests that, at least for this benchmark and this pretrained checkpoint, the model’s decision-making behavior is portable across a very different hardware and software stack.

The Apple Silicon Work

The difficult part was not the model concept. It was the practical engineering required to move research code from a CUDA-first environment onto macOS with Metal/MPS.

The fork keeps the core LeWorldModel code intact, but it adapts the runtime assumptions that prevented the evaluation from running cleanly on Apple Silicon:

dynamic device selection for cuda, mps, or cpu;
planner configuration that can target MPS instead of CUDA;
macOS-safe handling instead of forcing Linux-oriented rendering defaults such as MUJOCO_GL=egl;
compatibility with the installed stable-worldmodel API, including World.evaluate_from_dataset(...);
MPS-safe floating-point handling in jepa.py, because Metal does not support float64 tensors.

That last detail is easy to underestimate. A tensor that works on CPU or CUDA can fail on MPS if it is double precision. For Apple Silicon, floating tensors need to be cast to float32 before moving them to the device. This kind of small compatibility issue is often the real work of reproducibility: not inventing a new model, but removing enough hidden assumptions that the original result can run somewhere else.

The fork also adds practical tooling around the experiment: a Hugging Face checkpoint conversion script, a smoke test for MPS/CUDA/CPU, benchmark presets, JSON/log output, and a self-contained HTML report.

Accuracy Is Portable; Runtime Needs Context

The Mac mini M4 run took 831.0511 seconds, or about 13.85 minutes. Most of that time was spent in CEM planning, with 803.306 seconds of total CEM time.

That runtime should not be framed as a direct speed comparison against the paper. The hardware is different, the backend is different, the dependency versions may differ, and the exact runtime path is not necessarily identical to the authors’ setup. The paper discusses LeWM’s planning efficiency under its own experimental conditions, but this local run is best understood as a reproducibility exercise rather than a hardware benchmark.

The stronger claim is about accuracy: a consumer Apple Silicon machine reproduced the paper’s LeWM TwoRoom success rate to within one percentage point using a pretrained checkpoint. That is a meaningful result for anyone who wants to inspect, modify, and understand a model locally instead of only reading about it.

What This Does and Does Not Prove

This experiment does not prove that a Mac mini M4 can train LeWorldModel from scratch at research-lab scale. It also does not reproduce every environment in the paper. The scope here is narrower and more concrete: a paper-like TwoRoom evaluation using the pretrained LeWM checkpoint.

Within that scope, the result is encouraging. The model ran locally, the planner worked, the environment executed, and the success rate landed very close to the published number. That is exactly the kind of practical reproducibility result that helps turn a paper from a static PDF into something a developer can study and extend.

The broader lesson is that local AI work is not only about raw speed. It is about access. A run that takes 14 minutes on a desk-sized machine is still valuable if it lets you inspect the code path, change the planner, test the checkpoint, and understand the system without needing a remote research cluster.

Conclusion

The Mac mini M4 did not «beat» a research lab, and it does not need to. What it did was more useful: it reproduced a meaningful LeWorldModel result on consumer Apple Silicon with a small set of targeted engineering changes.

LeWorldModel reported 87% success on TwoRoom in the paper. This fork reached 86% locally on a Mac mini M4. That is close enough to make the experiment worth taking seriously, and it shows that the distance between lab research and desktop reproducibility is shrinking.

Open papers, open code, and open checkpoints are only the beginning. The next step is making those results runnable on the machines developers actually use. This fork is one example of that work: moving LeWorldModel out of a CUDA-first research context and onto a small Apple Silicon machine, while keeping the original paper and implementation at the center of the story.

References

Paper: LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
Original GitHub repository: lucas-maes/le-wm
Apple Silicon fork: carlosap78/le-wm