Research Project

Integrated Perception, Planning, and Decision-Making Network

A unified multi-task network that fuses RGB, LiDAR, and infrared for closed-loop perception, planning, and decision-making in simulation.

Integrated Perception, Planning, and Decision-Making Network preview
Timeline
2021.08–2022.10
Context
The Future Laboratory of the Second Aerospace Academy
Role
Perception and Simulation Developer
Stage
Pre-research

Overview

What is this project about?

A unified multi-task framework that fuses multi-modal sensors (RGB, LiDAR, infrared) through attention-based feature fusion — jointly solving geometric–semantic mapping, unsupervised depth and odometry, multi-object detection and tracking, and closed-loop behavior decisions inside one end-to-end trainable network.

research e2e perception

System Architecture

One model, four interdependent abilities

Heterogeneous sensor streams are encoded by modality-specific extractors, then fused by interactive cross- and self-attention into a single representation that drives perception, reconstruction, and decision heads — forming a closed-loop perception → planning → decision pipeline.

Logic map

One model, four abilities, one closed loop

Heterogeneous sensors fuse into a shared representation that drives perception, mapping, planning, and decisions — then the action loops back. Hover a node.

Unified multi-task network architecture
Unified network. Camera / LiDAR / infrared are independently encoded, dynamically fused via cross- and self-attention, then decoded into structure reconstruction, uncertainty-aware segmentation, and multi-target detection — while feeding a hierarchically pre-trained branch for state perception and behavior decisions.

Module Breakdown

Four research modules

Switch between the four modules to see what each one does, the pipeline behind it, and the simulation result.

Module 01 · Webots + YOLOv8

2D detection and multi-object tracking

Render a virtual scene in Webots, detect objects per frame, then keep stable IDs and trajectories across time.

01 · Input

Webots simulation

  • Virtual scene → sensor model → RGB camera → frame stream
02 · Detection

YOLOv8 inference

  • Preprocess / normalize → inference → confidence filter + NMS → boxes + class + score
03 · Tracking

Hungarian + Kalman

  • IoU / feature matching → Kalman predict & update → ID assignment and track management
04 · Output

ID + trajectory + velocity

  • ID-tagged video stream with trajectories overlaid in the simulator
Simulation result
Simulation exploration with detection and tracking
Detection and tracking during simulation exploration.

Module 02 · 2D ResNet ‖ 3D Sparse CNN

Joint geometry–semantic estimation

Encode image and point cloud separately, fuse them with dual-channel attention, and co-train four auxiliary tasks for a geometry–semantic consistent representation.

Unsup. depth Self-sup. segmentation Point-distance odometry Metric-learning recognition
Input

Image (2D)

  • 2D residual CNN encoder
Input

Point cloud (3D)

  • 3D sparse CNN encoder
Fusion

Dual-channel attention

  • Align & merge features → joint geometry–semantic representation
Depth + Seg

Dense scene

  • Depth-aware semantics
Odometry + Map

Ego motion

  • Scaled pose → 3D map
Architecture & demo
Multi-modal fusion network architecture
2D / 3D dual-encoder fusion with four parallel auxiliary tasks.
Multi-task estimation demo
Geometry–semantic estimation in motion.

Module 03 · ROS

Monocular visual SLAM with dense mapping

A real-time ROS pipeline: track the camera, optimize the graph, and reconstruct a dense map from a single camera.

01 · Front-end

Tracking

  • ORB features → init → keyframe selection → pose optimization (+ relocalization)
02 · Back-end

Graph optimization

  • Local BA ↔ loop closure (DBoW) ↔ global pose-graph
03 · Dense map

Reconstruction

  • Monocular depth → reprojection → TSDF / OctoMap fusion + voxel denoising
04 · Output

Three map products

  • /dense_pointcloud · /octomap_3d · /map_2d_grid
Real-time result
Monocular SLAM dense mapping
Real-time monocular SLAM with dense mapping.

Module 04 · Webots closed-loop

Perception → planning → decision

Perception feeds planning, planning feeds an FSM / RL decision agent, and the action loops back to the vehicle in simulation.

Perception layer

Understand the scene

  • Detection → obstacles · segmentation → drivable area · depth → distance · SLAM → map · odometry → ego pose
Planning layer

Find a path

  • Global A* / Dijkstra → local DWA / MPC → collision check & avoidance
Decision layer

Choose an action

  • FSM / RL agent with reward shaping → accelerate / brake / steer
Closed loop

Act & evaluate

  • Vehicle executes → metrics (success / collision rate, ATE, map IoU) → iterate
Closed-loop roadmap
Closed-loop perception-planning-decision roadmap
System roadmap: four modules forming a closed perception–planning–decision loop.
How they connect. Detection & tracking and geometry–semantic estimation supply targets and structure; dense SLAM builds the global map; planning & decision close the loop and feed evaluation back into every module.
Confidentiality note. Only high-level system modules are shown. Mission-specific and customer-specific details are omitted.