Research Project

Controllable Surround-View Driving Generation

A controllable multi-view world model for driving: 3D layout + map + multi-granularity control signals injected into a diffusion process to generate geometrically-consistent 4V / 7V / 11V images and video — for data augmentation and open-loop simulation.

Timeline: 2023.05–2024
Context: PhiGent Robotics
Role: Generative Driving Algorithm Engineer
Stage: Pre-research

Overview

What is this project about?

Built a controllable surround-view driving generator that compresses 3D boxes and maps into spatial conditions, encodes text / reference frames / lanes / camera calibration into condition tokens, and injects them into a UNet diffusion backbone — producing cross-camera-consistent 4V / 7V / 11V images and video for data augmentation and open-loop simulation, evolving from OpenSora 1.0 + SD 3.5 to a MagicDrive-fused in-house model.

research world-model generative e2e

4V / 7V / 11V surround 3D layout + HD map Text · refs · lanes · rig tokens UNet diffusion backbone OpenSora → MagicDrive fusion

4V·7V·11VCamera configurations supported

6+Control signals per generation

2 usesAugmentation · open-loop simulation

V2RGB, depth, ego-pose control

Logic map

Noise to controllable surround worlds

Hover a node to inspect how structure, tokens, and denoising create geometry-aligned 4V / 7V / 11V output.

Conditioned diffusion pipeline

7V controllable image generation pipeline

3D boxes and maps become spatial conditions; text, reference frames, lanes, and camera calibration become tokens. The UNet denoises from pure noise into aligned multi-view latents, then decodes pixels.

Scene replacement for augmentation

7V scene/style replacement: geometry held, appearance varied.

11V replacement uses the same map-conditioned consistency logic.

120 degree FOV single-view scene replacement

1V 120° FOV variant: controlled single-view regeneration.

Long-tail cone and lane-line synthesis without field collection.

Surround video generation

4V fisheye daylight rollout, generated as a temporally coherent clip.

Controllable 7V surround video after in-house driving pretraining.

V2: depth and ego control

v2 adds pixel-depth output as a second generated modality.

Ego-trajectory controllability improved for pose-guided rollouts.

Unbalanced real dataGenerate rare scenes on demand

Cross-camera driftProject boxes and maps per view

Style-only controlFuse layout, map, text, rig tokens

RGB-only worldsExtend to depth and ego-pose control

OmniNWM. After my departure, former colleagues led the follow-up OmniNWM direction: github.com/Ma-Zhuang/OmniNWM.

My role. Built the controllable driving generation pipeline at a high level: structured conditions, diffusion integration, and sanitized visualization for augmentation and open-loop simulation.

Confidentiality note. PhiGent Robotics research. Only sanitized generation results and high-level pipeline descriptions are shown; dataset details and internal evaluation metrics are omitted. The OmniNWM follow-up was led by former colleagues after my departure and is credited below.