摘要
针对现有端到端自动驾驶模型未考虑驾驶场景中不同区域的重要性和不同语义类别之间的关系而导致预测准确率低的问题,受驾驶人注意力机制和现有端到端自动驾驶模型的启发,充分考虑驾驶场景的动态变化、驾驶场景的语义信息和深度信息对驾驶行为决策的影响,以连续多帧驾驶场景的RGB图像为输入,构建一种基于注意力机制的多模态自动驾驶行为预测模型,实现对方向盘转角和车速的准确预测。首先,通过语义分割模型和单目深度估计模型分别获取RGB图像的语义图像和深度图像;其次,为剔除与驾驶行为决策无关信息,以神经科学和空间抑制理论为基础,设计一种拟人化注意力机制作为能量函数来计算驾驶场景中不同区域的重要度;为学习语义图像中与驾驶行为决策最为相关类别之间的关系,采用图注意力网络(Graph Attention Network, GAT)对驾驶场景的语义图像进行特征提取;然后,以保留RGB特征为原则对提取的驾驶场景的图像特征、语义特征和深度特征进行融合,采用卷积长短期记忆网络(Convolutional Long Short Term Memory, ConvLSTM)实现融合特征在连续多帧之间的传递,进而实现下一帧驾驶场景对应驾驶行为的预测;最后,与其他模型的对比试验、消融试验、泛化试验和特征可视化试验来充分验证所提出自动驾驶行为预测模型的性能。试验结果表明:与其他驾驶行为预测模型相比,所提出模型的训练误差为0.021 2,预测准确率为86.97%,均方误差为0.031 5,其驾驶行为的预测性能优于其他模型;连续多帧的语义图像和深度图像、拟人化注意力机制和面向语义特征提取的GAT有助于提升驾驶行为预测的性能;该模型具有较好的泛化能力,其做出驾驶行为预测所依赖的特征与经验丰富的驾驶人所关注的特征基本一致。
The accuracy of existing end-to-end autonomous driving behavior prediction models is low because the importance of each image area and the strong relationship between different semantic parts in semantic images are not considered. To solve this problem, inspired by the driver attention mechanism and existing autonomous driving behavior prediction models, and considering the impact of dynamic changes in driving scenes, and the semantic and depth information of driving scenes on driving-behavior prediction, a novel multimodal autonomous driving behavior prediction model based on the attention mechanism was developed. This was achieved by inputting continuous multiple frames of RGB images, which accurately predict the vehicle speed and steering wheel angle. First, semantic and depth images of RGB video frames were generated based on the segmentation network and monocular depth estimation model. Second, an anthropomorphic attention mechanism was proposed as energy to calculate the importance of each area in an image based on well-established neuroscience and spatial suppression theories, which ignore driving behavior-irrelevant information. Third, a graph attention network(GAT) was adopted to extract semantic image features, which can learn the relationship between the categories in the semantic image that are most relevant to driving behavior decisions. Fourth, RGB, semantic, and depth features were fused based on the principle that RGB features should be preserved. Convolutional long short-term memory(ConvLSTM) was used to fulfill the transition of fused features in multiple continuous frames. Subsequently, driving behavior of the next driving scene was predicted. Finally, these experiments, including a comparison with state-of-the-art models, an ablation study of our proposed model, a generalization experiment, and driving behavior feature visualization, were conducted to validate the performance of the proposed model. The results demonstrate that the performance is better than that of other models(training loss is 0.021 2, accuracy is 86.97%, and mean square error is 0.031 5). The ablation study indicates that the semantic and depth images of continuous multiple frames, the anthropomorphic attention mechanism, and the GAT for extracting semantic features, assists in improving driving-behavior prediction performance. The generalization experiment also demonstrated that the proposed model has a good generalization performance. The feature visualization experiments demonstrate that the features upon which our proposed model relies to predict driving behavior are fundamentally identical to those of experienced drivers.
作者
郭应时
黄涛
GUO Ying-shi;HUANG Tao(School of Automobile,Chang'an University,Xi'an 710064,Shaanxi,China)
出处
《中国公路学报》
EI
CAS
CSCD
北大核心
2022年第9期141-156,共16页
China Journal of Highway and Transport
基金
国家重点研发计划项目(2019YFB1600500)
长安大学研究生科研实践创新项目(300103722003)。
关键词
汽车工程
自动驾驶模型
注意力机制
驾驶行为
多模态
automotimve engineering
autonomous driving model
attention mechanism
driving behavior
multimodal