Master Thesis (all genders) – Semantic 4D Occupancy Forecasting

Internship/work-study programme , Full-time or part-time | Karlsruhe, Krumbach, Berlin, Ingolstadt, Augsburg, Erlangen, Leipzig, Münster, München

Abstract

Semantic 4D occupancy forecasting is vital for safe autonomous driving, allowing vehicles to anticipate future scene dynamics and geometry. However, training state-of-the-art models relies heavily on fully supervised methods that require massive, prohibitively expensive dense 3D voxel annotations.

To overcome this data bottleneck, cutting-edge research is shifting towards self-supervised and weakly-supervised paradigms that leverage pre-trained 2D foundation models (e.g., DINOv2, CLIP, or SAM). By aligning these rich, open-vocabulary 2D semantic features with 3D/4D spatial representations using advanced Transformer architectures, it is possible to achieve robust spatial-temporal understanding without dense 3D ground truth.

Building upon these breakthroughs, this Master’s thesis focuses on developing a foundation-model-aligned framework for vision-based 4D occupancy forecasting. You will design an architecture that distills rich multi-view semantics into a 4D forecasting pipeline, bridging the gap between scalable camera-only inputs and high-fidelity environment prediction.

For outstanding results, we actively encourage and support submissions to top-tier conferences.

These tasks interest you

Develop a Transformer-based network for predicting future semantic 4D occupancy from sequential multi-view camera inputs using weak or self-supervision.
Build and train the PyTorch pipeline, designing alignment mechanisms to distill semantic features from 2D foundation models into your 4D spatial-temporal representation.
Benchmark against fully-supervised baselines on large-scale datasets (e.g., nuScenes, OpenOccupancy), focusing on forecasting accuracy (IoU), semantic precision, and label efficiency.

That makes you stand out

You are registered in a master’s program in computer science, artificial intelligence, robotics, or a related field.
You have excellent programming skills in Python as well as solid experience with deep learning frameworks (especially PyTorch).
You have a solid background in 3D computer vision. Practical experience with semantic segmentation, occupancy networks, or 3D Gaussian splatting is a major plus.
You have knowledge of Vision Transformers (ViT), Foundation Models (DINO, CLIP), and paradigms of self- and weakly-supervised learning.
You work independently and are solution-oriented, highly motivated, and have very good German and English skills (at least C1 level) to ensure clear and confident communication within the team and with our partners.

Your contact person

Daniela
+49 821 885882-0
work@xitaso.com

Apply for this position