Stimulate Reasoning Abilities of LLM in Interactive Prediction and Planning

ABOUT THE PROJECT

At a glance

Large Language Models (LLMs) have demonstrated substantial capabilities in a range of applications within autonomous driving [1-11], including scene reasoning, trajectory prediction, and vehicle motion planning [1-4]. For instance, DriveLM [1] built a fine-tuned LLM for perception, prediction, and planning tasks, while [5] leveraged LLMs to formulate and elucidate the actions taken by autonomous vehicles. In the trajectory prediction area, [3] integrated LLM as a component within their trajectory forecasting model to enhance the accuracy of predictions, while [6] incorporated language as part of the features in the model to improve the interpretability of the prediction. LLMs facilitate enhanced interpretability due to their linguistic foundation and offer substantial knowledge and control, benefiting from embedded common knowledge and the adaptability to incorporate supplementary knowledge via prompts. However, few existing LLM-based trajectory prediction or planning works considered utilizing LLMs for analyzing high-level interactions, like yielding relations caused by traffic lights, signs or crosswalks, where LLMs are expected to demonstrate robust capabilities. 

The LLM-based applications in autonomous driving have to be trained on language driving datasets, where intense research has been done recently [1, 4-5, 7-11]. Notably, DriveLM [1] enriched the nuScenes Dataset with comprehensive language analysis. Similarly, DriveGPT4 [5] leveraged the language-annotated BDD-X dataset to generate question-answer pairs, thereby improving the interpretability of its action planning module. Drive with LLMs [7] developed QA labels for synthetic scenarios, and then fine-tuned an LLM for action planning. Moreover, DRAMA [11] provided annotations for driving videos, highlighting imminent risks to the primary vehicle and the underlying rationale. Additional contributions include NuScenes-QA [8] and Rank2Tell [9], which offer language descriptions for the NuScenes dataset and a collection of 116 self-gathered clips, respectively. More recently, [1] and [10] further provides driving scene analysis on the CARLA-simulated driving data. However, these datasets fall short in including highly interactive real-world scenes and conducting high-level interaction analyses, and their overall size remains limited compared to the state-of-the-art. Consequently, It is unsurprising that the models built on them exhibit limited abilities in interaction analysis. 
 

principal investigatorsresearchersthemes

Masayoshi Tomizuka

Wei Zhan

 Large Language Model, interactive behavior, reasoning