Yifan Wang1,4, Yian Zhao2, Fanqi Pu1, Xiaochen Yang3, Yang Tang4,†, Xi Chen4, Wenming Yang1,†
1Tsinghua University 2Peking University 3University of Glasgow 4Tencent BAC
†Corresponding Authors
📧 yf-wang23@mails.tsinghua.edu.cn yang.wenming@sz.tsinghua.edu.cn
This repository hosts the official implementation of SPAN (Spatial-Projection Alignment), a novel framework for monocular 3D object detection that addresses the geometric consistency constraints overlooked in existing decoupled regression paradigms.
SPAN introduces a unified geometric consistency optimization paradigm that comprises two pivotal components:
-
Spatial Point Alignment: Enforces an explicit global spatial constraint between predicted and ground-truth 3D bounding boxes by aligning their eight corner coordinates in the camera coordinate system, thereby rectifying spatial drift caused by decoupled attribute regression.
-
3D-2D Projection Alignment: Ensures that the projected 3D box is aligned tightly within its corresponding 2D detection bounding box on the image plane, mitigating projection misalignment overlooked in previous works.
To ensure training stability, we further introduce a Hierarchical Task Learning (HTL) strategy that progressively incorporates spatial-projection alignment as 3D attribute predictions refine, preventing early stage error propagation across attributes.
- 🎯 Spatial Point Alignment: Constrains 3D bounding box corners to align with ground-truth corners
- 📐 3D-2D Projection Alignment: Ensures projected 3D boxes match their 2D detection boxes
- 📈 Hierarchical Task Learning: Progressive training strategy for stable optimization
- 🔌 Plug-and-Play: Can be easily integrated into any monocular 3D detector
- ⚡ Zero Inference Cost: No additional modules or computational overhead at inference time
-
Clone the repository:
git clone https://github.com/WYFDUT/SPAN.git cd SPAN conda create -n span python=3.8 conda activate span -
Install dependencies:
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html pip install -r requirements.txt cd lib/models/monodgp/ops/ bash make.sh cd ../../../..
-
Install OpenPCDet (if needed):
cd OpenPCDet python setup.py develop cd ..
Download KITTI datasets and prepare the directory structure as:
│SPAN/
├──...
│data/kitti/
├──ImageSets/
├──training/
│ ├──image_2
│ ├──label_2
│ ├──calib
├──testing/
│ ├──image_2
│ ├──calibUpdate the dataset path in configs/span.yaml:
dataset:
root_dir: '/path/to/KITTI'Basic usage:
bash train.sh configs/span.yamlWith custom GPU:
CUDA_VISIBLE_DEVICES=0 bash train.sh configs/span.yamlCheckpoints are saved to the path specified in trainer.save_path.
The best checkpoint will be evaluated as default. You can change it at "tester/checkpoint" in configs/span.yaml:
bash test.sh configs/span.yamlThe official results in the paper:
| Models | Val, AP3D|R40 | ||
| Easy | Mod. | Hard | |
| MonoDGP + (SPAN) | 30.98% | 23.26% | 20.17% |
This repo results on KITTI Val Split:
| Models | Val, AP3D|R40 | Logs | ckpt | ||
| Easy | Mod. | Hard | |||
| MonoDGP + (SPAN) | 31.92% | 23.32% | 20.00% | log | ckpt |
| 30.94% | 23.34% | 20.21% | log | - | |
| 31.81% | 23.44% | 20.29% | log | ckpt | |
The official results in the paper on KITTI Test Split:
| Models | Test, AP3D|R40 | ckpt | ||
| Easy | Mod. | Hard | ||
| MonoDGP + (SPAN) | 27.02% | 19.30% | 16.49% | - |
Test results submitted to the official KITTI Benchmark:
Car category:
All categories:
If you use this code in your research, please cite:
@misc{wang2025spanspatialprojectionalignmentmonocular,
title={SPAN: Spatial-Projection Alignment for Monocular 3D Object Detection},
author={Yifan Wang and Yian Zhao and Fanqi Pu and Xiaochen Yang and Yang Tang and Xi Chen and Wenming Yang},
year={2025},
eprint={2511.06702},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.06702},
}This project is licensed under the Apache License 2.0. See the LICENSE file for details.
This repo benefits from the excellent work MonoDGP, OpenPCDet, MGIoU and related monocular 3D detection frameworks.