Semantic 4D occupancy forecasting is vital for safe autonomous driving, allowing vehicles to anticipate future scene dynamics and geometry. However, training state-of-the-art models relies heavily on fully supervised methods that require massive, prohibitively expensive dense 3D voxel annotations.
To overcome this data bottleneck, cutting-edge research is shifting towards self-supervised and weakly-supervised paradigms that leverage pre-trained 2D foundation models (e.g., DINOv2, CLIP, or SAM). By aligning these rich, open-vocabulary 2D semantic features with 3D/4D spatial representations using advanced Transformer architectures, it is possible to achieve robust spatial-temporal understanding without dense 3D ground truth.
Building upon these breakthroughs, this Master’s thesis focuses on developing a foundation-model-aligned framework for vision-based 4D occupancy forecasting. You will design an architecture that distills rich multi-view semantics into a 4D forecasting pipeline, bridging the gap between scalable camera-only inputs and high-fidelity environment prediction.
For outstanding results, we actively encourage and support submissions to top-tier conferences.
Daniela
+49 821 885882-0
work@xitaso.com