Research Platform

Vector Traffic Generation & Sensor-Level Closed-Loop Simulation

Two halves of a controllable driving simulator: a structure-aware temporal vector world model that compresses and generates traffic as latents, and a sensor-level closed-loop pipeline that reconstructs, populates, and re-renders photorealistic surround video.

Timeline: 2025.05–Present
Context: Bosch (XC-CN)
Role: World Models Algorithm Engineer
Stage: Ongoing

Overview

What is this project about?

Built a two-level controllable driving simulator: a structure-aware temporal vector VAE (STAR-AE) that compresses sparse, variable agents and lanes into fixed latents, a conditional latent-diffusion generator (STRIDENet) that produces history-consistent future traffic, and a sensor-level closed-loop WorldSim that fuses Gaussian-Splatting reconstruction, traffic-flow generation, and a mask-guided DiT video editor (built on MagicDrive-V2) into photorealistic surround rollouts.

research world-model generative e2e

Structure-aware temporal VAE Conditional latent diffusion Gaussian-Splatting reconstruction Mask-guided DiT editor MagicDrive-V2 base

2 levelsVector control · sensor control

Fixed zVariable agents/lanes → one latent

4 masksKeep · context · edge · generate

Closed loopReconstruct → populate → re-render

Logic map

Two control levels — vectors decide what happens, sensors decide what cameras see

Left lane generates traffic as latents; right lane reconstructs photoreal background; both merge into a mask-guided video editor.

Temporal vector AE in motion

The VAE encodes sparse, variable scenes into a fixed latent and reconstructs them — agents and lanes stay temporally coherent.

Architecture, both halves

STAR-AE structure-aware temporal vector VAE — STAR-AE — slotify variable agents/lanes, then factorize time / space / cross-domain attention into one fixed latent.

STRIDENet conditional latent diffusion architecture — STRIDENet — denoise in the standardized latent space, conditioned on history, with decode-domain physics regularization.

Sensor-level closed loop

WorldSim closed-loop simulation framework

Three pipelines close the loop: Gaussian-Splatting reconstruction builds the real background, traffic generation populates it, and a DiT video world renders the surround result.

Mask-guided DiT — edit, don't regenerate

Four semantic masks partition every frame so the model only computes what must change.

Mask	Region	Action
M_keep	Known background	Frozen — skip all compute
M_ctx	Reference background	Cached — provide K/V only
M_edge	Fg/bg boundary	Active — repair the seam
M_gen	Foreground	Active — generate by condition

Mask-guided DiT architecture on MagicDrive-V2

Background is locked every step; foreground tokens align to BBox trajectories and read background appearance for seamless style fusion.

Before vs after the mask-guided editor

Direct baseline. Applying the original paper as-is — scene completion breaks down.

Mask-guided DiT. Foreground generated under different lighting, background untouched.

Closed-loop results

Precise control under a rainy-weather condition.

Pure-noise generation — full surround scene from scratch.

My role. Designed the structure-aware temporal vector world model (STAR-AE + STRIDENet) and built the sensor-level closed loop on a MagicDrive-V2-based mask-guided DiT editor. Wording is high-level to protect enterprise confidentiality.

Confidentiality note. Bosch (XC-CN) ongoing research. Architecture and method are presented at a portfolio level; internal data, calibration, metrics, and product details are intentionally omitted or sanitized.