Built a Cosmos-Transfer2.5-based generative simulation platform: a 7V surround world model validated on internal data, real-map (Ingolstadt OSM → layout → 7V) scenario generation, a gRPC semantic bridge between WorldSim and the world model, the first 4-step distillation of 7V surround video (rCM + DMD2) for up to ~13.9× speedup, an editable platform for rare interaction data, and an all-in-one OneModel that serves layout generation, Gaussian-Splatting fix, and harmonization from a single denoiser.
Built a two-level controllable driving simulator: a structure-aware temporal vector VAE (STAR-AE) that compresses sparse, variable agents and lanes into fixed latents, a conditional latent-diffusion generator (STRIDENet) that produces history-consistent future traffic, and a sensor-level closed-loop WorldSim that fuses Gaussian-Splatting reconstruction, traffic-flow generation, and a mask-guided DiT video editor (built on MagicDrive-V2) into photorealistic surround rollouts.
A one-stage, pure-vision end-to-end driving POC that lifts 8 surround cameras into a single BEV feature, reads three structured perception heads (3D detection, HD map, occupancy) from it, predicts the next-frame BEV under generative scoring, and tokenizes everything into a Diffusion-Flow planner that emits the ego trajectory and neighbouring-agent states — perception, prediction, and planning optimised jointly.
End-to-end production perception on a mid-trim (J6E / J6M) platform, organised around three shipped systems: a multi-task static OneModel that drives every static element from one shared BEV feature, a 4D-sparse dynamic model that unifies detection and tracking, and an on-board latency-compression effort that cut inference from ~42.65 ms to ~13.88 ms. My work spans architecture, a unified data pipeline, heterogeneous multi-task training, release engineering, and quantization-aware deployment.
An end-to-end autonomous-driving system that fuses 11 surround cameras (7 pinhole + 4 fisheye) with LiDAR under a sparse-centric (SparseDrive-style) paradigm. My two core deliverables: a fused BEV-fusion CUDA operator that aligns 11-camera and LiDAR features in a single kernel, and the training of an AI planner that outputs motion and planning in parallel from a shared query decoder.
Built a controllable surround-view driving generator that compresses 3D boxes and maps into spatial conditions, encodes text / reference frames / lanes / camera calibration into condition tokens, and injects them into a UNet diffusion backbone — producing cross-camera-consistent 4V / 7V / 11V images and video for data augmentation and open-loop simulation, evolving from OpenSora 1.0 + SD 3.5 to a MagicDrive-fused in-house model.
Hozon Auto × SJTU IRMV · PhiGent Robotics · Perception Team Leader · 3D Perception Algorithm Engineer
A two-phase journey in autonomous-driving auto-labeling: first a Tesla-AI-Day-inspired vision-only 4D auto-labeling pipeline with Hozon Auto and SJTU IRMV, then a multi-modal 4D auto-labeling and production pure-LiDAR 3D detection system at PhiGent Robotics — optimized at the data, model, and loss levels.
PhiGent Robotics · 3D Scene Flow Algorithm Engineer
A 3D motion-estimation stack for autonomous driving: an unsupervised auto-labeling system that assigns a 3D scene-flow vector to every LiDAR point and every occupancy cell, validated by lifting the accuracy of existing flow estimators, distilled into an ultra-light production head, and deployed end-to-end through ONNX, TensorRT (Orin) and the Horizon J6E toolchain.
A road-surface perception project for the road-preview ('magic-carpet') suspension feature: segment safety-critical small road elements — manhole covers and speed bumps — reliably under hard real-world conditions (tiny targets, water and oil stains, textureless surfaces), then compress and quantize the model to INT8 for efficient TDA4 edge inference, reaching an initial mass-production quality bar.
A short introduction film I made for my 2023 master's graduation — a compact tour of my research focus, the labs and mentors I worked with, and the perception and 3D/4D systems I built along the way.
The robot has to drive itself off a transport vehicle, reach the lawn, mow, and return — so I built its safety-critical perception stack across four modules: ramp detection for self loading/unloading, 3D grass-obstacle detection (geometry first, then camera–LiDAR fusion), an MCU-deployed 2D BEV safety detector, and a dual-attention LiDAR–vision fusion study.