Beyond LLMs: Yann LeCun’s Critique and the JEPA Research Program

Yann LeCun is not saying that LLMs are useless. That would be too simple. His argument is more precise: language models are very good at manipulating language, code, and other symbolic systems, but they do not appear to be a sufficient path toward intelligence that can act reliably in the physical world.

The distinction matters. A chatbot that writes, summarizes, or codes mostly lives in the space of symbols. A robot, an industrial control system, an autonomous vehicle, or a medical assistant that intervenes in real processes needs something more: it must be able to predict what will happen if it acts.

The Core Thesis

LeCun argues that robust intelligence requires at least two capabilities:

Predicting the consequences of actions.
Planning through search or optimization.

An LLM predicts the next token. It may appear to reason, plan, or understand, but its base mechanism is still autoregressive symbol generation. For LeCun, that is not the same as having an internal model of the world.

This critique does not deny the progress of LLMs. In fact, it helps explain why they work so well in some domains: language, mathematics, and programming are already encoded as discrete sequences. Code also gives us compilers, tests, and verification. The physical world does not offer that convenience.

Why the Physical World Is Harder Than Language

Reality is continuous, noisy, partially observed, and high-dimensional. Predicting every pixel of a future video is usually the wrong target: too much information is irrelevant for deciding what to do.

If I push a bottle near the top, the important thing is not predicting every reflection on the plastic. The important thing is anticipating that the bottle will probably fall.

This is where JEPA enters.

What Is JEPA?

JEPA stands for Joint Embedding Predictive Architecture. The idea is to learn prediction in representation space, not in the raw space of pixels, audio, or text.

Instead of reconstructing an image pixel by pixel, a JEPA model learns latent representations: objects, positions, relations, motion, affordances, and other useful factors for understanding what is happening.

The relevant scientific evidence already exists across several research lines:

I-JEPA demonstrates self-supervised image learning through representation prediction.
V-JEPA extends that logic to video.
DINO, DINOv2, BYOL, and VICReg show that self-supervised learning can produce useful visual representations without traditional human labels.
LeJEPA/SIGReg addresses representational collapse through explicit regularization.
LeWorldModel connects JEPA-style representation learning with world models used for planning and control.

The Collapse Problem

Joint embedding methods have a trivial failure mode: the model can assign the same representation to every input. If everything is represented the same way, prediction becomes easy, but the system learns nothing.

This is called representational collapse.

A large part of the JEPA program is about avoiding that collapse. Several families of solutions exist:

Contrastive learning: pull related representations together and push unrelated representations apart.
BYOL- or DINO-style distillation: a student model predicts representations from a teacher model.
Explicit regularization such as VICReg: preserve variance and reduce redundancy.
SIGReg, proposed in LeJEPA: push representations toward an isotropic Gaussian distribution.

The important point is that JEPA is not just a philosophical intuition. It is a concrete research program about how to learn useful representations without reconstructing the entire world in unnecessary detail.

World Models

A JEPA model becomes more interesting when it is conditioned on actions.

An agent with a world model can do something like this:

Observe the current state.
Imagine several possible actions.
Predict the latent consequences of each action.
Evaluate those consequences with a cost function.
Choose the action that best satisfies the objective while respecting constraints.

That looks more like planning than autocomplete.

This is why LeCun talks about objective-driven AI: systems that act by optimizing explicit objectives over predictions of the world, not merely by generating the most likely next output.

The Teenager Learning to Drive

One of LeCun’s clearest arguments is the driving comparison.

A teenager can learn to drive with a few dozen hours of practice. Autonomous driving systems, by contrast, have consumed enormous amounts of driving data and still have not solved fully general Level 5 autonomy.

The conclusion is not that humans are born knowing how to drive. The conclusion is that humans arrive with rich prior world models: intuitive physics, object permanence, the intentions of other agents, spatial maps, social norms, and sensitivity to risk.

A system that only imitates demonstrations has to learn too much from data. A system with world models should be able to generalize from fewer examples.

Safety: A Strong Critique, Not a Settled Result

LeCun also argues that LLMs are intrinsically unsafe as agents. His argument is that:

they can hallucinate;
they do not natively predict the real-world consequences of their actions;
there is no hard guarantee that the prompt received corresponds to the objective actually executed.

His alternative is an agent with a world model, explicit objectives, and safety constraints built into the optimization process.

This is conceptually attractive, but it does not solve everything. An objective-driven system can still fail if its world model is wrong or if its cost function is badly specified. The difference is that the control points appear more explicit and inspectable.

What Is Proven and What Is Not

It is useful to separate three levels.

Well supported:

Self-supervised representation learning works very well in vision.
I-JEPA and V-JEPA are real implementations of the JEPA approach.
Serious techniques exist for preventing representational collapse.
Latent world models are a valid research direction for planning and control.

Plausible but still open:

Whether JEPA can scale as well as LLMs scaled.
Whether learned world models are enough for open-ended robotics.
Whether SIGReg becomes the dominant solution for preventing collapse.
Whether objective-driven AI is safer in complex real systems.

Speculative:

That LLMs will be replaced as the dominant paradigm.
That JEPA will lead to human-level intelligence.
That the paradigm shift will become obvious in the near term.

My Reading

LeCun’s critique is strongest when framed this way: LLMs are extraordinary for symbolic domains, but they are not a complete architecture for robust physical agency.

The JEPA program tries to fill that gap. It does not simply promise bigger models. It proposes a different way to learn: latent representations, abstract prediction, world models, and planning through objectives.

The important question is not whether LLMs «work» or «do not work.» They work. The question is whether the next leap in AI will come from continuing to scale token prediction or from building systems that learn how the world changes when they act.

LeCun is betting on the second path.

References

Assran et al., «Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture»: https://arxiv.org/abs/2301.08243
Bardes et al., «Revisiting Feature Prediction for Learning Visual Representations from Video»: https://arxiv.org/abs/2404.08471
Balestriero and LeCun, «LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics»: https://arxiv.org/abs/2511.08544
Maes et al., «LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels»: https://arxiv.org/abs/2603.19312
Ha and Schmidhuber, «World Models»: https://arxiv.org/abs/1803.10122
Bardes, Ponce and LeCun, «VICReg»: https://arxiv.org/abs/2105.04906
Caron et al., «DINO»: https://arxiv.org/abs/2104.14294
Oquab et al., «DINOv2»: https://arxiv.org/abs/2304.07193