Research Project

Controllable Surround-View Driving Generation

A controllable multi-view world model for driving: 3D layout + map + multi-granularity control signals injected into a diffusion process to generate geometrically-consistent 4V / 7V / 11V images and video — for data augmentation and open-loop simulation.

Timeline
2023.05–2024
Context
PhiGent Robotics
Role
Generative Driving Algorithm Engineer
Stage
Pre-research

Overview

What is this project about?

Built a controllable surround-view driving generator that compresses 3D boxes and maps into spatial conditions, encodes text / reference frames / lanes / camera calibration into condition tokens, and injects them into a UNet diffusion backbone — producing cross-camera-consistent 4V / 7V / 11V images and video for data augmentation and open-loop simulation, evolving from OpenSora 1.0 + SD 3.5 to a MagicDrive-fused in-house model.

research world-model generative e2e
4V / 7V / 11V surround 3D layout + HD map Text · refs · lanes · rig tokens UNet diffusion backbone OpenSora → MagicDrive fusion
4V·7V·11VCamera configurations supported
6+Control signals per generation
2 usesAugmentation · open-loop simulation
V2RGB, depth, ego-pose control

Logic map

Noise to controllable surround worlds

Hover a node to inspect how structure, tokens, and denoising create geometry-aligned 4V / 7V / 11V output.

Conditioned diffusion pipeline

7V controllable image generation pipeline
3D boxes and maps become spatial conditions; text, reference frames, lanes, and camera calibration become tokens. The UNet denoises from pure noise into aligned multi-view latents, then decodes pixels.

Scene replacement for augmentation

7V map-conditioned scene replacement
7V scene/style replacement: geometry held, appearance varied.
11V map-conditioned scene replacement
11V replacement uses the same map-conditioned consistency logic.
120 degree FOV single-view scene replacement
1V 120° FOV variant: controlled single-view regeneration.
Generated traffic cones and lane lines
Long-tail cone and lane-line synthesis without field collection.

Surround video generation

4V fisheye daylight rollout, generated as a temporally coherent clip.
Controllable 7V surround video after in-house driving pretraining.

V2: depth and ego control

v2 adds pixel-depth output as a second generated modality.
Ego-trajectory controllability improved for pose-guided rollouts.
Unbalanced real dataGenerate rare scenes on demand
Cross-camera driftProject boxes and maps per view
Style-only controlFuse layout, map, text, rig tokens
RGB-only worldsExtend to depth and ego-pose control
OmniNWM. After my departure, former colleagues led the follow-up OmniNWM direction: github.com/Ma-Zhuang/OmniNWM.
My role. Built the controllable driving generation pipeline at a high level: structured conditions, diffusion integration, and sanitized visualization for augmentation and open-loop simulation.
Confidentiality note. PhiGent Robotics research. Only sanitized generation results and high-level pipeline descriptions are shown; dataset details and internal evaluation metrics are omitted. The OmniNWM follow-up was led by former colleagues after my departure and is credited below.