This study presents KEPT, an AI system that helps self-driving cars predict their own short-term path more safely by combining video understanding with a memory of similar past scenes. Tested on the public nuScenes benchmark, KEPT cuts prediction errors and potential collisions compared with existing planning methods, while using a fast, lightweight retrieval module that is practical for real-time driving.
Researchers from Tongji University and their international collaborators have developed a new AI system that helps self-driving cars โrememberโ similar past situations before choosing how to move next. The method, called KEPT (Knowledge-Enhanced Prediction of Trajectories), allows a vision-language model to predict the carโs short-term path directly from front-view camera video while consulting a large library of previous real-world driving clips.
โShort-horizon trajectory prediction is where many autonomous driving systems still struggle, especially in complex, busy scenes,โ said first author Yujin Wang from the School of Automotive Studies at Tongji University. โOur idea was to let a vision-language model not only look at the current frames, but also recall how similar scenes have unfolded before, and then plan a safe, feasible motion based on that prior experience.โ
To make this possible, the team first designed a new video encoder that turns short clips of consecutive driving frames into compact vectors that capture both spatial layout and motion cues. The encoder, called a temporal frequencyโspatial fusion (TFSF) module, combines a fast-Fourier-transform-based frequency attention block with multi-scale features from a Swin Transformer and a lightweight temporal transformer over seven frames sampled at 2 Hz. This design helps the model focus on subtle motion changes and fine-grained scene structure that matter for near-term planning.
The TFSF encoder is trained in a self-supervised way, without manual labels. It learns to bring visually and dynamically similar clips closer in its embedding space and push dissimilar clips apart, using a contrastive loss with a memory queue and hard-negative mining. This produces robust clip-level embeddings that can be used directly for retrieval.
On top of these embeddings, the researchers built a scalable retrieval pipeline. All clips in a large driving corpus are encoded and stored in a vector database. At run time, the current 3-second camera sequence is embedded and first routed to a nearby cluster by k-means, then matched to its closest neighbors using a hierarchical navigable small-world (HNSW) index. The system retrieves a handful of most similar scenes and their ground-truth trajectories, which act as strong priors for the planner while keeping retrieval latency low.
These retrieved trajectories are not used in a black-box way. Instead, KEPT injects them into a chain-of-thought style prompt for a vision-language model, alongside the current video frames and explicit safety and kinematic constraints. The model is guided to compare the new scene with the retrieved examples, reason about similarities and differences, and then output a 3-second ego trajectory that respects speed limits, smoothness, and collision avoidance requirements.
To make a general-purpose vision-language backbone suitable for this task, the team introduces a triple-stage fine-tuning scheme. In the first stage, the model is fine-tuned on visual question-answering tasks about object category, size, and distance to strengthen spatial grounding. In the second stage, it learns to regress future trajectories from surround-view images and basic kinematics, with losses that penalize unsafe curvature and abrupt maneuvers. In the final stage, it predicts the full trajectory from only consecutive front-view frames, learning to align its language head with short-term temporal structure. All three stages use lightweight LoRA adapters to keep adaptation efficient.
The researchers evaluated KEPT on the widely used nuScenes benchmark, comparing it against both traditional end-to-end planning baselines and more recent vision-language-based planners. Across standard open-loop metrics, KEPT achieved the best overall performance, reducing prediction error while maintaining competitive or lower collision indicators. Ablation experiments further showed that each componentโself-supervised TFSF pre-training, the clustered retrieval stack, the triple-stage fine-tuning, and using multiple retrieved exemplarsโcontributes measurably to the final accuracy and safety profile.
โVision-language models are powerful reasoners, but in driving they can easily hallucinate or ignore physical constraints if we just ask them to โdraw a pathโ,โ said corresponding author Prof. Bingzhao Gao. โBy grounding the model in a bank of real trajectories and training it on metrics that directly reflect motion feasibility and collision risk, KEPT turns this reasoning ability into something much closer to an engineerable planning module.โ
Beyond benchmark scores, the work points to a broader design pattern for autonomous driving: instead of treating large models as end-to-end black boxes, they can be wrapped with retrieval, structured prompts, and physically meaningful objectives to provide more transparent, data-efficient, and safety-aware behavior. The authors note that KEPT currently focuses on short-horizon, open-loop evaluation on a single dataset and camera configuration, and that closed-loop testing, richer sensor inputs, and more diverse driving regions are key directions for future research.
The team envisions that similar knowledge-enhanced planners could eventually support not only automated vehicles, but also advanced driver-assistance systems that explain their recommendations to human drivers in everyday language. By combining retrieval, vision, and language, KEPT offers a concrete step toward autonomous systems that can both drive and justify how they drive.
