As 2026 passes its midpoint, the autonomous driving industry’s biggest debate is no longer confined to research labs—it is increasingly being fought on production lines and in boardrooms. The question dividing automakers and tech suppliers is whether advanced driving should be built on Vision-Language-Action (VLA) models or on World Models, which attempt to simulate how the physical world evolves. Huawei and Geely have recently questioned VLA’s limits publicly, while Li Auto and XPeng are pushing ahead with mass production of VLA-based architectures.
Yet the argument over model design is also a struggle for standard-setting power in the “embodied intelligence” era—when cars may become platforms for broader robot intelligence rather than stand-alone mobility products. That rush is happening while industry finances deteriorate sharply. Automaker profit margins have reportedly fallen to 2.9% in early 2026, intensifying the pressure to find new growth engines.
At the heart of the VLA-versus-World-Model clash are two different claims about what “intelligence” should mean for autonomous systems.
VLA, or Vision-Language-Action, inserts language as a reasoning middle layer between perception and control. In this view, the system interprets what it sees through the lens of large language model (LLM) reasoning, then selects actions. XPeng’s General Intelligence Center head, Liu Xianming, summarizes it as a form of learning “what humans would do in this world.” Practically, the model turns road conditions into a scene meaning representation—captured in language-like reasoning—before mapping that understanding into driving trajectories.
World Models operate on a different principle: they focus less on what humans would do and more on predicting how the environment changes after an action. Rather than using language tokens to compress the scene, the model learns causal physical dynamics in continuous space. A typical example in which a system sees a ball bouncing roadside and predicts that a child will likely run out from behind it—an expectation built from learned causality. Waymo’s World Model, unveiled at CVPR 2026, is positioned as a simulation engine for dynamic road scenes, including the behavior intentions of vehicles and pedestrians.
The two approaches also correspond to different technical emphases. VLA aims to represent the 3D world through 1D textual tokens to leverage linguistic reasoning. World Models instead attempt to simulate physical law and causality directly in state space, trading linguistic compression for continuous spatial modeling.
In the commercial race, those philosophical differences have hardened into factional camps—though the industry is beginning to converge.
Huawei is presented as a flagship World Model advocate. Its WEWA 1.0 architecture, first proposed in ADS4 in 2025, moves the “world engine” into the cloud to generate difficult cases and increase long-tail scenario density. On the vehicle side, Huawei deploys a World Behavior Model using multi-sensor full-modal perception. Experts estimates that by 2026, more than 80 vehicle models equipped with Huawei’s intelligent driving solution could be shipping, with cumulative installations reaching 3 million units by year-end and annual R&D spending exceeding 18 billion yuan.
Waymo follows a similar World Model trajectory. At CVPR 2026, it outlined a training framework that reuses a Genie 3 general-purpose World Model for pre-training, then injects Waymo’s autonomous driving data—reported as over 200 million miles—to teach the model to understand multiple autonomous driving sensors. A final stage fine-tunes and distills the system for tasks like long-sequence simulation and planning, enabling generation of multi-sensor simulation data (including images and point clouds) and long-tail scenario production such as flooding or tornado-like conditions.
Momenta, meanwhile, is building a fusion path—combining World Models with reinforcement learning. Its R7 Reinforcement Learning World Model, released at the April 2026 Beijing Auto Show, is designed as a physical AI foundation model that uses “golden” training segments derived from large-scale real mileage, closed-loop simulation inside the World Model, and reinforcement learning exploration in high-fidelity virtual environments.
In the VLA camp, Li Auto and XPeng are portrayed as accelerating toward deployment. Li Auto’s MindVLA-o1 uses a multimodal mixture-of-experts Transformer, with a 3D ViT encoder for reconstructing video into 3D representations and fusing that with LiDAR geometry. Its decision layer is described as an implicit World Model stacked on top of a language model, generating near-future previews in latent space and then producing vehicle-dynamics-compliant trajectories. XPeng’s second-generation VLA is framed as more radical: it removes a traditional language translation intermediate step to reduce latency. Tesla’s FSD V14 takes yet another route—expanding end-to-end model scale and adding xAI Grok capabilities as an interpretation layer to improve intent understanding and decision explainability.
Despite the rivalry, the binary battle is losing clarity. Explicit statements from XPeng that VLA and World Models are not mutually exclusive, and to cross-integration examples across companies: XPeng reportedly trains with both VLA and World Models; Li Auto has added prediction mechanisms into its VLA approach; and Xiaomi’s OneVL claims to unify VLA, World Models, and latent-space reasoning under one framework.
Meanwhile, the strategic landscape is widening beyond driving. The embodied intelligence and humanoid robot sector is drawing near-universal attention, with nearly 20 automakers—BYD, XPeng, GAC, Chery, Changan, Li Auto, SAIC and overseas players including Tesla, BMW, Hyundai, and Mercedes-Benz—reported to be investing, incubating, or self-developing.
The financial logic is straightforward: with intense competition and price wars, automakers need new growth points. And the robot market narrative is attractive. IDC forecasts that China’s embodied intelligent robot user spending could grow from about $1.4 billion in 2025 to $77 billion by 2030, with extremely rapid growth rates.
The enabling argument is reuse: perception hardware, decision algorithms, and execution control can translate from automotive contexts into robotics. Data and scenario advantages may be harder to replicate elsewhere—automakers already generate large-scale real-world interaction data and have factories and public service channels that can become robot deployment environments.
