MultiPark: Multimodal Parking Transformer with Next-Segment Prediction

¹Shanghai Jiao Tong University ²Zhuoyu Tech ³Hong Kong University of Science and Technology

*Corresponding Author †Project Lead: hanzheng@sjtu.edu.cn

Abstract

Parking accurately and safely in highly constrained spaces remains a critical challenge. Unlike structured driving environments, parking requires executing complex maneuvers such as frequent gear shifts and steering saturation. Recent attempts to employ imitation learning (IL) for parking have achieved promising results. However, existing works ignore the multimodal nature of parking behavior in lane-free open space, failing to derive multiple plausible solutions under the same situation. Notably, IL-based methods encompass inherent causal confusion, so enabling a neural network to generalize across diverse parking scenarios is particularly difficult. To address these challenges, we propose MultiPark, an autoregressive transformer for multimodal parking. To handle paths filled with abrupt turning points, we introduce a data-efficient next-segment prediction paradigm, enabling spatial generalization and temporal extrapolation. Furthermore, we design learnable parking queries factorized into gear, longitudinal, and lateral components, parallelly decoding diverse parking behaviors. To mitigate causal confusion in IL, our method employs target-centric pose and ego-centric collision as outcome-oriented loss across all modalities beyond pure imitation loss. Evaluations of real-world datasets demonstrate that MultiPark achieves state-of-the-art performance across various scenarios. We deploy MultiPark on a production vehicle, further confirming our approach’s robustness in real-world parking environments.

Motivation

In (a), the parking path involves both forward and backward gears, resulting in discontinuous segments with sharp turning points. In (b), under the same scenario, different drivers may yield distinct parking solutions. In (c), the blue path accumulates errors and causes collisions, but the green path can recover from mistakes occurring in the first segment.

Framework

A BEV encoder first takes inputs and obtains the scene features as key and value for cross-attention. Next, a query-based decoder explores both backward and forward gear, where the endpoint serves as the standpoint in the next segment, thereby rolling out autoregressively and producing multimodal parking paths. Finally, we select the optimal path and utilize the predicted waypoints to control the vehicle.

Performance

We compare MultiPark to six competitive baselines. Note that all baselines are trained on the same dataset until convergence, and we employ the same encoder network as ours for fair comparison. We report the test set results in the Table. Notably, MultiPark achieves SOTA closed-loop performance on real-world datasets and consistently outperforms the baselines in all metrics, underscoring its generalization across various scenarios.

BibTeX