Research Platform

Vector Traffic Generation & Sensor-Level Closed-Loop Simulation

Two halves of a controllable driving simulator: a structure-aware temporal vector world model that compresses and generates traffic as latents, and a sensor-level closed-loop pipeline that reconstructs, populates, and re-renders photorealistic surround video.

Timeline
2025.05–Present
Context
Bosch (XC-CN)
Role
World Models Algorithm Engineer
Stage
Ongoing

Overview

What is this project about?

Built a two-level controllable driving simulator: a structure-aware temporal vector VAE (STAR-AE) that compresses sparse, variable agents and lanes into fixed latents, a conditional latent-diffusion generator (STRIDENet) that produces history-consistent future traffic, and a sensor-level closed-loop WorldSim that fuses Gaussian-Splatting reconstruction, traffic-flow generation, and a mask-guided DiT video editor (built on MagicDrive-V2) into photorealistic surround rollouts.

research world-model generative e2e
Structure-aware temporal VAE Conditional latent diffusion Gaussian-Splatting reconstruction Mask-guided DiT editor MagicDrive-V2 base
2 levelsVector control · sensor control
Fixed zVariable agents/lanes → one latent
4 masksKeep · context · edge · generate
Closed loopReconstruct → populate → re-render

Logic map

Two control levels — vectors decide what happens, sensors decide what cameras see

Left lane generates traffic as latents; right lane reconstructs photoreal background; both merge into a mask-guided video editor.

Temporal vector AE in motion

The VAE encodes sparse, variable scenes into a fixed latent and reconstructs them — agents and lanes stay temporally coherent.

Architecture, both halves

STAR-AE structure-aware temporal vector VAE
STAR-AE — slotify variable agents/lanes, then factorize time / space / cross-domain attention into one fixed latent.
STRIDENet conditional latent diffusion architecture
STRIDENet — denoise in the standardized latent space, conditioned on history, with decode-domain physics regularization.

Sensor-level closed loop

WorldSim closed-loop simulation framework
Three pipelines close the loop: Gaussian-Splatting reconstruction builds the real background, traffic generation populates it, and a DiT video world renders the surround result.

Mask-guided DiT — edit, don't regenerate

Four semantic masks partition every frame so the model only computes what must change.

MaskRegionAction
M_keepKnown backgroundFrozen — skip all compute
M_ctxReference backgroundCached — provide K/V only
M_edgeFg/bg boundaryActive — repair the seam
M_genForegroundActive — generate by condition
Mask-guided DiT architecture on MagicDrive-V2
Background is locked every step; foreground tokens align to BBox trajectories and read background appearance for seamless style fusion.

Before vs after the mask-guided editor

Direct baseline. Applying the original paper as-is — scene completion breaks down.
Mask-guided DiT. Foreground generated under different lighting, background untouched.

Closed-loop results

Precise control under a rainy-weather condition.
Pure-noise generation — full surround scene from scratch.
My role. Designed the structure-aware temporal vector world model (STAR-AE + STRIDENet) and built the sensor-level closed loop on a MagicDrive-V2-based mask-guided DiT editor. Wording is high-level to protect enterprise confidentiality.
Confidentiality note. Bosch (XC-CN) ongoing research. Architecture and method are presented at a portfolio level; internal data, calibration, metrics, and product details are intentionally omitted or sanitized.