Production + Research Project

4D Auto-Labeling & Pure LiDAR 3D Detection

Two eras of autonomous-driving auto-labeling — a Tesla-inspired vision-only 4D pipeline, then a multi-modal and pure-LiDAR 3D detection system.

Timeline
2022.11–2024
Context
Hozon Auto × SJTU IRMV · PhiGent Robotics
Role
Perception Team Leader · 3D Perception Algorithm Engineer
Stage
Production & R&D

Overview

What is this project about?

A two-phase journey in autonomous-driving auto-labeling: first a Tesla-AI-Day-inspired vision-only 4D auto-labeling pipeline with Hozon Auto and SJTU IRMV, then a multi-modal 4D auto-labeling and production pure-LiDAR 3D detection system at PhiGent Robotics — optimized at the data, model, and loss levels.

production research 3d-4d perception deployment

Same Direction, Two Eras

From vision-only 4D labels to pure-LiDAR detection

Both phases share one goal — automatically producing high-quality 4D training labels for autonomous driving — but differ in era, team, and sensor stack. The timeline below moves from a Tesla-inspired vision-only pipeline to a multi-modal + pure-LiDAR production system.

Logic map

One goal, two eras — vision-only 4D, then multi-modal + pure LiDAR

Both lanes chase the same prize: high-quality 4D labels with no human in the loop. Hover a node to compare the eras.

01 2022.11 – 2023.04 Hozon Auto × SJTU IRMV Perception Team Leader

Video Offline 4D Auto-Labeling

A vision-only offline pipeline replicating the spirit of Tesla AI Day: from multi-camera video + IMU, estimate surround depth, lift to a 360° pseudo-LiDAR, separate static and dynamic, and fuse everything into a 4D scene.

Overall 4D auto-labeling pipeline
System pipeline. Multi-camera + IMU → parallel perception features (flow / pose / depth / semantics) → three branches (3D detection · static reconstruction · ground reconstruction) → a fused 4D scene with novel-view synthesis.
Step 1 · Depth

Surround-view depth → pseudo-LiDAR

  • Multi-frame cost volume + Conv-LSTM + BEV fusion → 360° pseudo-LiDAR; IMU/GPS pose refinement
Step 2 · Features

Perceptual features → motion state

  • Optical flow · semantic/instance seg · scene flow → classify each point dynamic / static
Step 3A · Static

Scene reconstruction

  • Warp multi-frame → global map; NeRF implicit ground + novel views
Step 3B · Dynamic

3D box labeling

  • Per-point scene flow → align frames → associate & weighted-average 3D boxes
Step 4 · Fuse

4D auto-labels

  • HD-map + 3D tracks + segmentation + synthetic views → time × space × semantics × motion
Foundation: pseudo-LiDAR
Surround-view depth estimation network
Surround-view depth → 360° pseudo-LiDAR, the geometric base for 4D reconstruction.
Static element reconstruction pipeline
Static elements. Instance segmentation + implicit (NeRF-style) rendering, separated by depth-guided scene flow.
Dynamic 3D box labeling pipeline
Dynamic objects. Scene-flow motion classification → cross-frame association → refined 3D boxes.
NeRF-based simulation data synthesis
Simulation synthesis. Implicit MLP predicts height & semantics; reprojection + cross-entropy drives high-quality synthetic labels.
Deep-dive walkthrough. The design, logic, and a breakdown of a Tesla-style visual 4D auto-labeling system.
02 2024 · Internal R&D PhiGent Robotics 3D Perception Algorithm Engineer

4D Auto-Labeling & Pure LiDAR 3D Detection

Moving to a multi-modal stack: harden a BEVFusion-based 4D auto-labeler on hard objects (VRUs, long trailers, far range), then ship a production pure-LiDAR 3D detector on solid-state LiDAR.

Workstream A · Data level

Adaptive VRU instance augmentation

VRUs (pedestrians, cyclists) are safety-critical but heavily under-represented. We build a VRU instance database and adaptively paste real instances into scenes to rebalance training.

VRU instance DB Adaptive paste Image + point-cloud joint align Depth-legal placement
① Collect

Instance mining

  • Extract every VRU: RGB patch + LiDAR segment + spatial meta (position / depth / pose)
② Database

Layered index

  • Index by class / distance band / scene type → diverse coverage
③ Fuse

Adaptive paste

  • Scene depth → legal position; match scale / occlusion; paste into image + point cloud jointly
④ Sync

Label synchronization

  • Auto-generate 2D BBox / 3D BBox / mask → complete supervision
Copy-paste augmentation
Real-instance copy-paste in image and point cloud
Real VRU instances pasted with geometric consistency across both modalities.
VRU metric improvement comparison
Result. Measurable precision & recall gains on VRU classes after adaptive instance augmentation.
Builds on. Real-instance paste from Real-Aug and adaptive geometric alignment from PGT-Aug (NeurIPS 2024), extended to joint Camera + LiDAR placement with depth-legal constraints and dynamic sampling.

Workstream B · Model level

BEV range & resolution tuning

Clamp the perception range to ≤ 210 m and raise BEV resolution, so far-field features stay dense — directly improving very-long-range 3D detection during training and release.

Before

Range > 210 m

Low BEV resolution

Far-field features sparse → weak long-range recall

After

Range ≤ 210 m

High BEV resolution

Far-field features dense → stronger long-range 3D detection

Workstream C · Loss level

Ultra-long trailer: geometry-alignment loss

Articulated trailers bend when turning, so a single box never fits and heading drifts. We add multi-box labeling for multi-section trailers and a loss that forces box boundaries to hug the LiDAR surface.

Multi-box labeling Point-to-boundary loss Heading correction
Problem

Articulated & occluded

  • Tractor + trailer 1 + trailer 2 … each section a different angle → single box distorts IoU; sparse points
Labeling

Multi-box supervision

  • Hand-label multi-section trailers so the model learns to emit several boxes per vehicle
Loss

Point-to-boundary alignment

  • For each LiDAR point P: cast a ray from box center through P
  • Ray ∩ box boundary = P_lLoss = dist(P, P_l)
  • Point outside → loss rotates/translates the box onto the cloud; aligned → loss → 0
Result

Tight, correctly-oriented boxes

  • Boxes hug the trailer surface; heading error sharply reduced
Geometry-alignment loss
Point-to-boundary geometry loss construction
Top: heading too large → loss pulls the box. Bottom: aligned → loss ≈ 0.
Trailer heading prediction failures
Before. Boxes miss the trailer body with large heading errors.
Automatic 3D detection and labeling result
After. Stable, tightly-fitted automatic 3D detection and labeling.

Workstream D · Production · Seyond Falcon K1

Pure LiDAR 3D detection

A production 3D detector for solid-state LiDAR — an open pipeline redesigned around reflectivity features and dilated convolutions, tuned for the sparse far field.

Reflectivity feature Dilation FPN CenterHead · NMS-free INT8 · Orin
01 · Preprocess

Point-cloud preparation

  • Range filter → RANSAC ground removal → reflectivity normalize → pillar voxelization
02 · Pillars

PillarFeatureNet

  • Feature [x,y,z,intensity,x_c,y_c,z_c,x_p,y_p]; reflectivity is the core cue; hash voxelization + INT8 MLP
03 · Scatter

Sparse → dense BEV

  • CUDA scatter via pillar-coordinate hash → dense pseudo-image (~35% faster end-to-end)
04 · Backbone

SECOND + Dilation FPN

  • Dilated deconv d = 1,2,4 enlarges receptive field for sparse far-field context
05 · Head

CenterHead

  • Center heatmap (GaussianFocalLoss) + regression (offset / z / size / rotation / velocity); NMS-free peak extraction
Algorithm framework
Pure LiDAR 3D detection framework
Solid-state-LiDAR detector: pillars → scatter → SECOND + Dilation FPN → CenterHead.

Ablation · mAP by class

What each idea contributes

MethodmAPCarBicyclePedestrianConeTruckBusTricycle
Pillar features (baseline)0.6670.9590.8350.7100.7220.6280.8630.615
+ Reflectivity0.7040.9660.8610.7800.7870.6530.8730.715
+ Dilation FPN0.7190.9640.8650.7930.8190.7010.8870.722
Voxel sparse features0.7140.9540.8740.8380.8440.5860.8390.775
Reflectivity + dilated conv (final)0.7630.9680.9090.8850.9070.7800.8920.760
Confidentiality note. Only public-level pipeline structure is shown. Internal datasets, labeling rules, exact metrics, model parameters, and customer-specific details are omitted; thresholds are illustrative.