Autonomous driving has advanced rapidly, transitioning from rule-based systems to deep neural networks. Yet end-to-end models still face major deficits: they often lack world knowledge, struggle in rare or ambiguous scenarios, and provide minimal insight into their decision-making process. Large language models (LLMs), by contrast, excel at reasoning, contextual understanding, and interpreting complex instructions. However, LLM outputs are linguistic rather than executable, making integration with real vehicle control difficult. These gaps highlight the need for frameworks that combine multi-modal perception with structured, actionable decision outputs grounded in established driving logic. Addressing these challenges requires deeper research into aligning multi-modal reasoning with autonomous driving planners.
A research team from Shanghai Jiao Tong University, Shanghai AI Laboratory, Tsinghua University, and collaborating institutions has developed DriveMLM, a multi-modal large language model framework for closed-loop autonomous driving. The findings were published (DOI: 10.1007/s44267-025-00095-w) on 26 November 2025 in Visual Intelligence. DriveMLM integrates multi-view camera images, LiDAR point clouds, system messages, and user instructions to produce aligned behavioral planning states. These states plug directly into existing motion-planning modules, enabling real-time driving control while generating natural-language explanations of each decision.
DriveMLM tackles a core challenge in LLM-based driving: converting linguistic reasoning into reliable control behavior. The framework aligns LLM outputs with the behavioral planning states used in modular systems such as Apollo, covering both speed decisions (KEEP, ACCELERATE, DECELERATE, STOP) and path decisions (FOLLOW, LEFT_CHANGE, RIGHT_CHANGE, and others).
A specialized multi-modal tokenizer processes multi-view temporal images, LiDAR data, traffic rules, and user instructions into unified token embeddings. A multi-modal LLM then predicts the appropriate decision state and produces an accompanying explanation, ensuring interpretability.
To support training, the team created a large-scale data engine that generated 280 hours of driving data across eight CARLA maps and 30 challenging scenarios, including rare safety-critical events. The pipeline automatically labels speed and path decisions and uses human refinements and GPT-based augmentation to produce rich explanatory annotations.
In closed-loop evaluation on the CARLA Town05 Long benchmark, DriveMLM achieved a Driving Score of 76.1, outperforming the Apollo baseline by 4.7 points, and recorded the highest miles per intervention (0.96) among all compared systems. DriveMLM also demonstrated strong open-loop decision accuracy, improved explanation quality, and robust performance under natural-language guidance—such as yielding to emergency vehicles or interpreting user commands like “overtake” under varying traffic conditions.
“Our study shows that LLMs, once aligned with structured decision states, can serve as powerful behavioral planners for autonomous vehicles,” the research team noted. “DriveMLM goes beyond rule-following. It understands complex scenes, reasons about motion, and explains its decisions in natural language—capabilities essential for safety and public trust. By combining perception, planning, and human instruction within a unified framework, DriveMLM offers a promising direction for next-generation autonomous driving systems.”
DriveMLM demonstrates how multi-modal LLMs can enhance transparency, flexibility, and safety in autonomous driving. Its plug-and-play design allows seamless integration into established systems such as Apollo or Autopilot, enabling improved decision-making without major architectural changes. The ability to interpret natural-language instructions expands possibilities for interactive driving assistance and personalized in-vehicle AI copilots. More broadly, DriveMLM highlights a path toward reasoning-driven autonomous systems capable of understanding complex environments, anticipating risks, and justifying their actions—key capabilities for deploying trustworthy AI in real transportation networks.
