Research Project

Integrated Perception, Planning, and Decision-Making Network

A unified multi-task network that fuses RGB, LiDAR, and infrared for closed-loop perception, planning, and decision-making in simulation.

Timeline: 2021.08–2022.10
Context: The Future Laboratory of the Second Aerospace Academy
Role: Perception and Simulation Developer
Stage: Pre-research

Overview

What is this project about?

A unified multi-task framework that fuses multi-modal sensors (RGB, LiDAR, infrared) through attention-based feature fusion — jointly solving geometric–semantic mapping, unsupervised depth and odometry, multi-object detection and tracking, and closed-loop behavior decisions inside one end-to-end trainable network.

research e2e perception

System Architecture

One model, four interdependent abilities

Heterogeneous sensor streams are encoded by modality-specific extractors, then fused by interactive cross- and self-attention into a single representation that drives perception, reconstruction, and decision heads — forming a closed-loop perception → planning → decision pipeline.

Logic map

One model, four abilities, one closed loop

Heterogeneous sensors fuse into a shared representation that drives perception, mapping, planning, and decisions — then the action loops back. Hover a node.

Unified network. Camera / LiDAR / infrared are independently encoded, dynamically fused via cross- and self-attention, then decoded into structure reconstruction, uncertainty-aware segmentation, and multi-target detection — while feeding a hierarchically pre-trained branch for state perception and behavior decisions.

Module Breakdown

Four research modules

Switch between the four modules to see what each one does, the pipeline behind it, and the simulation result.

Module 01 · Webots + YOLOv8

2D detection and multi-object tracking

Render a virtual scene in Webots, detect objects per frame, then keep stable IDs and trajectories across time.

01 · Input

Webots simulation

Virtual scene → sensor model → RGB camera → frame stream

02 · Detection

YOLOv8 inference

Preprocess / normalize → inference → confidence filter + NMS → boxes + class + score

03 · Tracking

Hungarian + Kalman

IoU / feature matching → Kalman predict & update → ID assignment and track management

04 · Output

ID + trajectory + velocity

ID-tagged video stream with trajectories overlaid in the simulator

Simulation result

Simulation exploration with detection and tracking — Detection and tracking during simulation exploration.

Module 02 · 2D ResNet ‖ 3D Sparse CNN

Joint geometry–semantic estimation

Encode image and point cloud separately, fuse them with dual-channel attention, and co-train four auxiliary tasks for a geometry–semantic consistent representation.

Unsup. depth Self-sup. segmentation Point-distance odometry Metric-learning recognition

Input

Image (2D)

2D residual CNN encoder

Input

Point cloud (3D)

3D sparse CNN encoder

Fusion

Dual-channel attention

Align & merge features → joint geometry–semantic representation

Depth + Seg

Dense scene

Depth-aware semantics

Odometry + Map

Ego motion

Scaled pose → 3D map

Architecture & demo

Multi-modal fusion network architecture — 2D / 3D dual-encoder fusion with four parallel auxiliary tasks.

Multi-task estimation demo — Geometry–semantic estimation in motion.

Module 03 · ROS

Monocular visual SLAM with dense mapping

A real-time ROS pipeline: track the camera, optimize the graph, and reconstruct a dense map from a single camera.

01 · Front-end

Tracking

ORB features → init → keyframe selection → pose optimization (+ relocalization)

02 · Back-end

Graph optimization

Local BA ↔ loop closure (DBoW) ↔ global pose-graph

03 · Dense map

Reconstruction

Monocular depth → reprojection → TSDF / OctoMap fusion + voxel denoising

04 · Output

Three map products

/dense_pointcloud · /octomap_3d · /map_2d_grid

Real-time result

Monocular SLAM dense mapping — Real-time monocular SLAM with dense mapping.

Module 04 · Webots closed-loop

Perception → planning → decision

Perception feeds planning, planning feeds an FSM / RL decision agent, and the action loops back to the vehicle in simulation.

Perception layer

Understand the scene

Detection → obstacles · segmentation → drivable area · depth → distance · SLAM → map · odometry → ego pose

Planning layer

Find a path

Global A* / Dijkstra → local DWA / MPC → collision check & avoidance

Decision layer

Choose an action

FSM / RL agent with reward shaping → accelerate / brake / steer

Closed loop

Act & evaluate

Vehicle executes → metrics (success / collision rate, ATE, map IoU) → iterate

Closed-loop roadmap

Closed-loop perception-planning-decision roadmap — System roadmap: four modules forming a closed perception–planning–decision loop.

How they connect. Detection & tracking and geometry–semantic estimation supply targets and structure; dense SLAM builds the global map; planning & decision close the loop and feed evaluation back into every module.

Confidentiality note. Only high-level system modules are shown. Mission-specific and customer-specific details are omitted.