Robotics 42
☆ Shortcut Learning in Generalist Robot Policies: The Role of Dataset Diversity and Fragmentation
Generalist robot policies trained on large-scale datasets such as Open
X-Embodiment (OXE) demonstrate strong performance across a wide range of tasks.
However, they often struggle to generalize beyond the distribution of their
training data. In this paper, we investigate the underlying cause of this
limited generalization capability. We identify shortcut learning -- the
reliance on task-irrelevant features -- as a key impediment to generalization.
Through comprehensive theoretical and empirical analysis, we uncover two
primary contributors to shortcut learning: (1) limited diversity within
individual sub-datasets, and (2) significant distributional disparities across
sub-datasets, leading to dataset fragmentation. These issues arise from the
inherent structure of large-scale datasets like OXE, which are typically
composed of multiple sub-datasets collected independently across varied
environments and embodiments. Our findings provide critical insights into
dataset collection strategies that can reduce shortcut learning and enhance the
generalization ability of generalist robot policies. Moreover, in scenarios
where acquiring new large-scale data is impractical, we demonstrate that
carefully selected robotic data augmentation strategies can effectively reduce
shortcut learning in existing offline datasets, thereby improving
generalization capabilities of generalist robot policies, e.g., $\pi_0$, in
both simulation and real-world environments. More information at
https://lucky-light-sun.github.io/proj/shortcut-learning-in-grps/.
comment: CoRL 2025
☆ V*: An Efficient Motion Planning Algorithm for Autonomous Vehicles
Autonomous vehicle navigation in structured environments requires planners
capable of generating time-optimal, collision-free trajectories that satisfy
dynamic and kinematic constraints. We introduce V*, a graph-based motion
planner that represents speed and direction as explicit state variables within
a discretised space-time-velocity lattice. Unlike traditional methods that
decouple spatial search from dynamic feasibility or rely on post-hoc smoothing,
V* integrates both motion dimensions directly into graph construction through
dynamic graph generation during search expansion. To manage the complexity of
high-dimensional search, we employ a hexagonal discretisation strategy and
provide formal mathematical proofs establishing optimal waypoint spacing and
minimal node redundancy under constrained heading transitions for
velocity-aware motion planning. We develop a mathematical formulation for
transient steering dynamics in the kinematic bicycle model, modelling steering
angle convergence with exponential behaviour, and deriving the relationship for
convergence rate parameters. This theoretical foundation, combined with
geometric pruning strategies that eliminate expansions leading to infeasible
steering configurations, enables V* to evaluate dynamically admissible
manoeuvres, ensuring each trajectory is physically realisable without further
refinement. We further demonstrate V*'s performance in simulation studies with
cluttered and dynamic environments involving moving obstacles, showing its
ability to avoid conflicts, yield proactively, and generate safe, efficient
trajectories with temporal reasoning capabilities for waiting behaviours and
dynamic coordination.
☆ L2Calib: $SE(3)$-Manifold Reinforcement Learning for Robust Extrinsic Calibration with Degenerate Motion Resilience IROS2025
Extrinsic calibration is essential for multi-sensor fusion, existing methods
rely on structured targets or fully-excited data, limiting real-world
applicability. Online calibration further suffers from weak excitation, leading
to unreliable estimates. To address these limitations, we propose a
reinforcement learning (RL)-based extrinsic calibration framework that
formulates extrinsic calibration as a decision-making problem, directly
optimizes $SE(3)$ extrinsics to enhance odometry accuracy. Our approach
leverages a probabilistic Bingham distribution to model 3D rotations, ensuring
stable optimization while inherently retaining quaternion symmetry. A
trajectory alignment reward mechanism enables robust calibration without
structured targets by quantitatively evaluating estimated tightly-coupled
trajectory against a reference trajectory. Additionally, an automated data
selection module filters uninformative samples, significantly improving
efficiency and scalability for large-scale datasets. Extensive experiments on
UAVs, UGVs, and handheld platforms demonstrate that our method outperforms
traditional optimization-based approaches, achieving high-precision calibration
even under weak excitation conditions. Our framework simplifies deployment on
diverse robotic platforms by eliminating the need for high-quality initial
extrinsics and enabling calibration from routine operating data. The code is
available at https://github.com/APRIL-ZJU/learn-to-calibrate.
comment: IROS2025
☆ Towards Balanced Behavior Cloning from Imbalanced Datasets
Robots should be able to learn complex behaviors from human demonstrations.
In practice, these human-provided datasets are inevitably imbalanced: i.e., the
human demonstrates some subtasks more frequently than others. State-of-the-art
methods default to treating each element of the human's dataset as equally
important. So if -- for instance -- the majority of the human's data focuses on
reaching a goal, and only a few state-action pairs move to avoid an obstacle,
the learning algorithm will place greater emphasis on goal reaching. More
generally, misalignment between the relative amounts of data and the importance
of that data causes fundamental problems for imitation learning approaches. In
this paper we analyze and develop learning methods that automatically account
for mixed datasets. We formally prove that imbalanced data leads to imbalanced
policies when each state-action pair is weighted equally; these policies
emulate the most represented behaviors, and not the human's complex, multi-task
demonstrations. We next explore algorithms that rebalance offline datasets
(i.e., reweight the importance of different state-action pairs) without human
oversight. Reweighting the dataset can enhance the overall policy performance.
However, there is no free lunch: each method for autonomously rebalancing
brings its own pros and cons. We formulate these advantages and disadvantages,
helping other researchers identify when each type of approach is most
appropriate. We conclude by introducing a novel meta-gradient rebalancing
algorithm that addresses the primary limitations behind existing approaches.
Our experiments show that dataset rebalancing leads to better downstream
learning, improving the performance of general imitation learning algorithms
without requiring additional data collection. See our project website:
https://collab.me.vt.edu/data_curation/.
☆ Surrogate-Enhanced Modeling and Adaptive Modular Control of All-Electric Heavy-Duty Robotic Manipulators
This paper presents a unified system-level modeling and control framework for
an all-electric heavy-duty robotic manipulator (HDRM) driven by
electromechanical linear actuators (EMLAs). A surrogate-enhanced actuator
model, combining integrated electromechanical dynamics with a neural network
trained on a dedicated testbed, is integrated into an extended virtual
decomposition control (VDC) architecture augmented by a natural adaptation law.
The derived analytical HDRM model supports a hierarchical control structure
that seamlessly maps high-level force and velocity objectives to real-time
actuator commands, accompanied by a Lyapunov-based stability proof. In
multi-domain simulations of both cubic and a custom planar triangular
trajectory, the proposed adaptive modular controller achieves sub-centimeter
Cartesian tracking accuracy. Experimental validation of the same 1-DoF platform
under realistic load emulation confirms the efficacy of the proposed control
strategy. These findings demonstrate that a surrogate-enhanced EMLA model
embedded in the VDC approach can enable modular, real-time control of an
all-electric HDRM, supporting its deployment in next-generation mobile working
machines.
comment: This is submitted to IEEE T-ASE
☆ Evaluating Robot Program Performance with Power Consumption Driven Metrics in Lightweight Industrial Robots
The code performance of industrial robots is typically analyzed through CPU
metrics, which overlook the physical impact of code on robot behavior. This
study introduces a novel framework for assessing robot program performance from
an embodiment perspective by analyzing the robot's electrical power profile.
Our approach diverges from conventional CPU based evaluations and instead
leverages a suite of normalized metrics, namely, the energy utilization
coefficient, the energy conversion metric, and the reliability coefficient, to
capture how efficiently and reliably energy is used during task execution.
Complementing these metrics, the established robot wear metric provides further
insight into long term reliability. Our approach is demonstrated through an
experimental case study in machine tending, comparing four programs with
diverse strategies using a UR5e robot. The proposed metrics directly compare
and categorize different robot programs, regardless of the specific task, by
linking code performance to its physical manifestation through power
consumption patterns. Our results reveal the strengths and weaknesses of each
strategy, offering actionable insights for optimizing robot programming
practices. Enhancing energy efficiency and reliability through this embodiment
centric approach not only improves individual robot performance but also
supports broader industrial objectives such as sustainable manufacturing and
cost reduction.
☆ Real-Time 3D Vision-Language Embedding Mapping
A metric-accurate semantic 3D representation is essential for many robotic
tasks. This work proposes a simple, yet powerful, way to integrate the 2D
embeddings of a Vision-Language Model in a metric-accurate 3D representation at
real-time. We combine a local embedding masking strategy, for a more distinct
embedding distribution, with a confidence-weighted 3D integration for more
reliable 3D embeddings. The resulting metric-accurate embedding representation
is task-agnostic and can represent semantic concepts on a global multi-room, as
well as on a local object-level. This enables a variety of interactive robotic
applications that require the localisation of objects-of-interest via natural
language. We evaluate our approach on a variety of real-world sequences and
demonstrate that these strategies achieve a more accurate object-of-interest
localisation while improving the runtime performance in order to meet our
real-time constraints. We further demonstrate the versatility of our approach
in a variety of interactive handheld, mobile robotics and manipulation tasks,
requiring only raw image data.
☆ Situationally-aware Path Planning Exploiting 3D Scene Graphs
Saad Ejaz, Marco Giberna, Muhammad Shaheer, Jose Andres Millan-Romera, Ali Tourani, Paul Kremer, Holger Voos, Jose Luis Sanchez-Lopez
3D Scene Graphs integrate both metric and semantic information, yet their
structure remains underutilized for improving path planning efficiency and
interpretability. In this work, we present S-Path, a situationally-aware path
planner that leverages the metric-semantic structure of indoor 3D Scene Graphs
to significantly enhance planning efficiency. S-Path follows a two-stage
process: it first performs a search over a semantic graph derived from the
scene graph to yield a human-understandable high-level path. This also
identifies relevant regions for planning, which later allows the decomposition
of the problem into smaller, independent subproblems that can be solved in
parallel. We also introduce a replanning mechanism that, in the event of an
infeasible path, reuses information from previously solved subproblems to
update semantic heuristics and prioritize reuse to further improve the
efficiency of future planning attempts. Extensive experiments on both
real-world and simulated environments show that S-Path achieves average
reductions of 5.7x in planning time while maintaining comparable path
optimality to classical sampling-based planners and surpassing them in complex
scenarios, making it an efficient and interpretable path planner for
environments represented by indoor 3D Scene Graphs.
☆ Mitigating Undesired Conditions in Flexible Production with Product-Process-Resource Asset Knowledge Graphs
Contemporary industrial cyber-physical production systems (CPPS) composed of
robotic workcells face significant challenges in the analysis of undesired
conditions due to the flexibility of Industry 4.0 that disrupts traditional
quality assurance mechanisms. This paper presents a novel industry-oriented
semantic model called Product-Process-Resource Asset Knowledge Graph (PPR-AKG),
which is designed to analyze and mitigate undesired conditions in flexible
CPPS. Built on top of the well-proven Product-Process-Resource (PPR) model
originating from ISA-95 and VDI-3682, a comprehensive OWL ontology addresses
shortcomings of conventional model-driven engineering for CPPS, particularly
inadequate undesired condition and error handling representation. The
integration of semantic technologies with large language models (LLMs) provides
intuitive interfaces for factory operators, production planners, and engineers
to interact with the entire model using natural language. Evaluation with the
use case addressing electric vehicle battery remanufacturing demonstrates that
the PPR-AKG approach efficiently supports resource allocation based on
explicitly represented capabilities as well as identification and mitigation of
undesired conditions in production. The key contributions include (1) a
holistic PPR-AKG model capturing multi-dimensional production knowledge, and
(2) the useful combination of the PPR-AKG with LLM-based chatbots for human
interaction.
comment: 3 pages, 1 figure
☆ EcBot: Data-Driven Energy Consumption Open-Source MATLAB Library for Manipulators
Existing literature proposes models for estimating the electrical power of
manipulators, yet two primary limitations prevail. First, most models are
predominantly tested using traditional industrial robots. Second, these models
often lack accuracy. To address these issues, we introduce an open source
Matlab-based library designed to automatically generate \ac{ec} models for
manipulators. The necessary inputs for the library are Denavit-Hartenberg
parameters, link masses, and centers of mass. Additionally, our model is
data-driven and requires real operational data, including joint positions,
velocities, accelerations, electrical power, and corresponding timestamps. We
validated our methodology by testing on four lightweight robots sourced from
three distinct manufacturers: Universal Robots, Franka Emika, and Kinova. The
model underwent testing, and the results demonstrated an RMSE ranging from 1.42
W to 2.80 W for the training dataset and from 1.45 W to 5.25 W for the testing
dataset.
☆ ADPro: a Test-time Adaptive Diffusion Policy for Robot Manipulation via Manifold and Initial Noise Constraints
Diffusion policies have recently emerged as a powerful class of visuomotor
controllers for robot manipulation, offering stable training and expressive
multi-modal action modeling. However, existing approaches typically treat
action generation as an unconstrained denoising process, ignoring valuable a
priori knowledge about geometry and control structure. In this work, we propose
the Adaptive Diffusion Policy (ADP), a test-time adaptation method that
introduces two key inductive biases into the diffusion. First, we embed a
geometric manifold constraint that aligns denoising updates with task-relevant
subspaces, leveraging the fact that the relative pose between the end-effector
and target scene provides a natural gradient direction, and guiding denoising
along the geodesic path of the manipulation manifold. Then, to reduce
unnecessary exploration and accelerate convergence, we propose an analytically
guided initialization: rather than sampling from an uninformative prior, we
compute a rough registration between the gripper and target scenes to propose a
structured initial noisy action. ADP is compatible with pre-trained diffusion
policies and requires no retraining, enabling test-time adaptation that tailors
the policy to specific tasks, thereby enhancing generalization across novel
tasks and environments. Experiments on RLBench, CALVIN, and real-world dataset
show that ADPro, an implementation of ADP, improves success rates,
generalization, and sampling efficiency, achieving up to 25% faster execution
and 9% points over strong diffusion baselines.
☆ REBot: Reflexive Evasion Robot for Instantaneous Dynamic Obstacle Avoidance
Dynamic obstacle avoidance (DOA) is critical for quadrupedal robots operating
in environments with moving obstacles or humans. Existing approaches typically
rely on navigation-based trajectory replanning, which assumes sufficient
reaction time and leading to fails when obstacles approach rapidly. In such
scenarios, quadrupedal robots require reflexive evasion capabilities to perform
instantaneous, low-latency maneuvers. This paper introduces Reflexive Evasion
Robot (REBot), a control framework that enables quadrupedal robots to achieve
real-time reflexive obstacle avoidance. REBot integrates an avoidance policy
and a recovery policy within a finite-state machine. With carefully designed
learning curricula and by incorporating regularization and adaptive rewards,
REBot achieves robust evasion and rapid stabilization in instantaneous DOA
tasks. We validate REBot through extensive simulations and real-world
experiments, demonstrating notable improvements in avoidance success rates,
energy efficiency, and robustness to fast-moving obstacles. Videos and appendix
are available on https://rebot-2025.github.io/.
☆ Depth Jitter: Seeing through the Depth
Depth information is essential in computer vision, particularly in underwater
imaging, robotics, and autonomous navigation. However, conventional
augmentation techniques overlook depth aware transformations, limiting model
robustness in real world depth variations. In this paper, we introduce
Depth-Jitter, a novel depth-based augmentation technique that simulates natural
depth variations to improve generalization. Our approach applies adaptive depth
offsetting, guided by depth variance thresholds, to generate synthetic depth
perturbations while preserving structural integrity. We evaluate Depth-Jitter
on two benchmark datasets, FathomNet and UTDAC2020 demonstrating its impact on
model stability under diverse depth conditions. Extensive experiments compare
Depth-Jitter against traditional augmentation strategies such as ColorJitter,
analyzing performance across varying learning rates, encoders, and loss
functions. While Depth-Jitter does not always outperform conventional methods
in absolute performance, it consistently enhances model stability and
generalization in depth-sensitive environments. These findings highlight the
potential of depth-aware augmentation for real-world applications and provide a
foundation for further research into depth-based learning strategies. The
proposed technique is publicly available to support advancements in depth-aware
augmentation. The code is publicly available on
\href{https://github.com/mim-team/Depth-Jitter}{github}.
☆ Computer Vision-based Adaptive Control for Back Exoskeleton Performance Optimization
Andrea Dal Prete, Seyram Ofori, Chan Yon Sin, Ashwin Narayan, Francesco Braghin, Marta Gandolla, Haoyong Yu
Back exoskeletons can reduce musculoskeletal strain, but their effectiveness
depends on support modulation and adaptive control. This study addresses two
challenges: defining optimal support strategies and developing adaptive control
based on payload estimation. We introduce an optimization space based on muscle
activity reduction, perceived discomfort, and user preference, constructing
functions to identify optimal strategies. Experiments with 12 subjects revealed
optimal operating regions, highlighting the need for dynamic modulation. Based
on these insights, we developed a vision-based adaptive control pipeline that
estimates payloads in real-time by enhancing exoskeleton contextual
understanding, minimising latency and enabling support adaptation within the
defined optimisation space. Validation with 12 more subjects showed over 80%
accuracy and improvements across all metrics. Compared to static control,
adaptive modulation reduced peak back muscle activation by up to 23% while
preserving user preference and minimising discomfort. These findings validate
the proposed framework and highlight the potential of intelligent,
context-aware control in industrial exoskeletons.
☆ Affordance-R1: Reinforcement Learning for Generalizable Affordance Reasoning in Multimodal Large Language Model
Hanqing Wang, Shaoyang Wang, Yiming Zhong, Zemin Yang, Jiamin Wang, Zhiqing Cui, Jiahao Yuan, Yifan Han, Mingyu Liu, Yuexin Ma
Affordance grounding focuses on predicting the specific regions of objects
that are associated with the actions to be performed by robots. It plays a
vital role in the fields of human-robot interaction, human-object interaction,
embodied manipulation, and embodied perception. Existing models often neglect
the affordance shared among different objects because they lack the
Chain-of-Thought(CoT) reasoning abilities, limiting their out-of-domain (OOD)
generalization and explicit reasoning capabilities. To address these
challenges, we propose Affordance-R1, the first unified affordance grounding
framework that integrates cognitive CoT guided Group Relative Policy
Optimization (GRPO) within a reinforcement learning paradigm. Specifically, we
designed a sophisticated affordance function, which contains format,
perception, and cognition rewards to effectively guide optimization directions.
Furthermore, we constructed a high-quality affordance-centric reasoning
dataset, ReasonAff, to support training. Trained exclusively via reinforcement
learning with GRPO and without explicit reasoning data, Affordance-R1 achieves
robust zero-shot generalization and exhibits emergent test-time reasoning
capabilities. Comprehensive experiments demonstrate that our model outperforms
well-established methods and exhibits open-world generalization. To the best of
our knowledge, Affordance-R1 is the first to integrate GRPO-based RL with
reasoning into affordance reasoning. The code of our method and our dataset is
released on https://github.com/hq-King/Affordance-R1.
☆ Beyond Constant Parameters: Hyper Prediction Models and HyperMPC
Model Predictive Control (MPC) is among the most widely adopted and reliable
methods for robot control, relying critically on an accurate dynamics model.
However, existing dynamics models used in the gradient-based MPC are limited by
computational complexity and state representation. To address this limitation,
we propose the Hyper Prediction Model (HyperPM) - a novel approach in which we
project the unmodeled dynamics onto a time-dependent dynamics model. This
time-dependency is captured through time-varying model parameters, whose
evolution over the MPC prediction horizon is learned using a neural network.
Such formulation preserves the computational efficiency and robustness of the
base model while equipping it with the capacity to anticipate previously
unmodeled phenomena. We evaluated the proposed approach on several challenging
systems, including real-world F1TENTH autonomous racing, and demonstrated that
it significantly reduces long-horizon prediction errors. Moreover, when
integrated within the MPC framework (HyperMPC), our method consistently
outperforms existing state-of-the-art techniques.
☆ Graph-based Robot Localization Using a Graph Neural Network with a Floor Camera and a Feature Rich Industrial Floor
Accurate localization represents a fundamental challenge in
robotic navigation. Traditional methodologies, such as Lidar or QR-code based
systems, suffer from inherent scalability and adaptability con straints,
particularly in complex environments. In this work, we propose
an innovative localization framework that harnesses flooring characteris tics
by employing graph-based representations and Graph Convolutional
Networks (GCNs). Our method uses graphs to represent floor features,
which helps localize the robot more accurately (0.64cm error) and more
efficiently than comparing individual image features. Additionally, this
approach successfully addresses the kidnapped robot problem in every
frame without requiring complex filtering processes. These advancements
open up new possibilities for robotic navigation in diverse environments.
comment: Accepted at 28th RoboCup International Symposium, Salvador, Brasil
☆ GMF-Drive: Gated Mamba Fusion with Spatial-Aware BEV Representation for End-to-End Autonomous Driving
Diffusion-based models are redefining the state-of-the-art in end-to-end
autonomous driving, yet their performance is increasingly hampered by a
reliance on transformer-based fusion. These architectures face fundamental
limitations: quadratic computational complexity restricts the use of
high-resolution features, and a lack of spatial priors prevents them from
effectively modeling the inherent structure of Bird's Eye View (BEV)
representations. This paper introduces GMF-Drive (Gated Mamba Fusion for
Driving), an end-to-end framework that overcomes these challenges through two
principled innovations. First, we supersede the information-limited
histogram-based LiDAR representation with a geometrically-augmented pillar
format encoding shape descriptors and statistical features, preserving critical
3D geometric details. Second, we propose a novel hierarchical gated mamba
fusion (GM-Fusion) architecture that substitutes an expensive transformer with
a highly efficient, spatially-aware state-space model (SSM). Our core BEV-SSM
leverages directional sequencing and adaptive fusion mechanisms to capture
long-range dependencies with linear complexity, while explicitly respecting the
unique spatial properties of the driving scene. Extensive experiments on the
challenging NAVSIM benchmark demonstrate that GMF-Drive achieves a new
state-of-the-art performance, significantly outperforming DiffusionDrive.
Comprehensive ablation studies validate the efficacy of each component,
demonstrating that task-specific SSMs can surpass a general-purpose transformer
in both performance and efficiency for autonomous driving.
comment: 7 pages, 4 figures
☆ Bounding Distributional Shifts in World Modeling through Novelty Detection
Recent work on visual world models shows significant promise in latent state
dynamics obtained from pre-trained image backbones. However, most of the
current approaches are sensitive to training quality, requiring near-complete
coverage of the action and state space during training to prevent divergence
during inference. To make a model-based planning algorithm more robust to the
quality of the learned world model, we propose in this work to use a
variational autoencoder as a novelty detector to ensure that proposed action
trajectories during planning do not cause the learned model to deviate from the
training data distribution. To evaluate the effectiveness of this approach, a
series of experiments in challenging simulated robot environments was carried
out, with the proposed method incorporated into a model-predictive control
policy loop extending the DINO-WM architecture. The results clearly show that
the proposed method improves over state-of-the-art solutions in terms of data
efficiency.
comment: 7 pages, 6 figures
☆ Incremental Language Understanding for Online Motion Planning of Robot Manipulators IROS 2025
Human-robot interaction requires robots to process language incrementally,
adapting their actions in real-time based on evolving speech input. Existing
approaches to language-guided robot motion planning typically assume fully
specified instructions, resulting in inefficient stop-and-replan behavior when
corrections or clarifications occur. In this paper, we introduce a novel
reasoning-based incremental parser which integrates an online motion planning
algorithm within the cognitive architecture. Our approach enables continuous
adaptation to dynamic linguistic input, allowing robots to update motion plans
without restarting execution. The incremental parser maintains multiple
candidate parses, leveraging reasoning mechanisms to resolve ambiguities and
revise interpretations when needed. By combining symbolic reasoning with online
motion planning, our system achieves greater flexibility in handling speech
corrections and dynamically changing constraints. We evaluate our framework in
real-world human-robot interaction scenarios, demonstrating online adaptions of
goal poses, constraints, or task objectives. Our results highlight the
advantages of integrating incremental language understanding with real-time
motion planning for natural and fluid human-robot collaboration. The
experiments are demonstrated in the accompanying video at
www.acin.tuwien.ac.at/42d5.
comment: 8 pages, 9 figures, accepted at IROS 2025
★ ME$^3$-BEV: Mamba-Enhanced Deep Reinforcement Learning for End-to-End Autonomous Driving with BEV-Perception
Autonomous driving systems face significant challenges in perceiving complex
environments and making real-time decisions. Traditional modular approaches,
while offering interpretability, suffer from error propagation and coordination
issues, whereas end-to-end learning systems can simplify the design but face
computational bottlenecks. This paper presents a novel approach to autonomous
driving using deep reinforcement learning (DRL) that integrates bird's-eye view
(BEV) perception for enhanced real-time decision-making. We introduce the
\texttt{Mamba-BEV} model, an efficient spatio-temporal feature extraction
network that combines BEV-based perception with the Mamba framework for
temporal feature modeling. This integration allows the system to encode vehicle
surroundings and road features in a unified coordinate system and accurately
model long-range dependencies. Building on this, we propose the
\texttt{ME$^3$-BEV} framework, which utilizes the \texttt{Mamba-BEV} model as a
feature input for end-to-end DRL, achieving superior performance in dynamic
urban driving scenarios. We further enhance the interpretability of the model
by visualizing high-dimensional features through semantic segmentation,
providing insight into the learned representations. Extensive experiments on
the CARLA simulator demonstrate that \texttt{ME$^3$-BEV} outperforms existing
models across multiple metrics, including collision rate and trajectory
accuracy, offering a promising solution for real-time autonomous driving.
☆ ReNiL: Relative Neural Inertial Locator with Any-Scale Bayesian Inference
Pedestrian inertial localization is key for mobile and IoT services because
it provides infrastructure-free positioning. Yet most learning-based methods
depend on fixed sliding-window integration, struggle to adapt to diverse motion
scales and cadences, and yield inconsistent uncertainty, limiting real-world
use. We present ReNiL, a Bayesian deep-learning framework for accurate,
efficient, and uncertainty-aware pedestrian localization. ReNiL introduces
Inertial Positioning Demand Points (IPDPs) to estimate motion at contextually
meaningful waypoints instead of dense tracking, and supports inference on IMU
sequences at any scale so cadence can match application needs. It couples a
motion-aware orientation filter with an Any-Scale Laplace Estimator (ASLE), a
dual-task network that blends patch-based self-supervision with Bayesian
regression. By modeling displacements with a Laplace distribution, ReNiL
provides homogeneous Euclidean uncertainty that integrates cleanly with other
sensors. A Bayesian inference chain links successive IPDPs into consistent
trajectories. On RoNIN-ds and a new WUDataset covering indoor and outdoor
motion from 28 participants, ReNiL achieves state-of-the-art displacement
accuracy and uncertainty consistency, outperforming TLIO, CTIN, iMoT, and RoNIN
variants while reducing computation. Application studies further show
robustness and practicality for mobile and IoT localization, making ReNiL a
scalable, uncertainty-aware foundation for next-generation positioning.
☆ PASG: A Closed-Loop Framework for Automated Geometric Primitive Extraction and Semantic Anchoring in Robotic Manipulation ICCV 2025
The fragmentation between high-level task semantics and low-level geometric
features remains a persistent challenge in robotic manipulation. While
vision-language models (VLMs) have shown promise in generating affordance-aware
visual representations, the lack of semantic grounding in canonical spaces and
reliance on manual annotations severely limit their ability to capture dynamic
semantic-affordance relationships. To address these, we propose Primitive-Aware
Semantic Grounding (PASG), a closed-loop framework that introduces: (1)
Automatic primitive extraction through geometric feature aggregation, enabling
cross-category detection of keypoints and axes; (2) VLM-driven semantic
anchoring that dynamically couples geometric primitives with functional
affordances and task-relevant description; (3) A spatial-semantic reasoning
benchmark and a fine-tuned VLM (Qwen2.5VL-PA). We demonstrate PASG's
effectiveness in practical robotic manipulation tasks across diverse scenarios,
achieving performance comparable to manual annotations. PASG achieves a
finer-grained semantic-affordance understanding of objects, establishing a
unified paradigm for bridging geometric primitives with task semantics in
robotic manipulation.
comment: Accepted to ICCV 2025. 8 pages main paper, 8 figures, plus
supplementary material
☆ Dynamical Trajectory Planning of Disturbance Consciousness for Air-Land Bimodal Unmanned Aerial Vehicles
Air-land bimodal vehicles provide a promising solution for navigating complex
environments by combining the flexibility of aerial locomotion with the energy
efficiency of ground mobility. To enhance the robustness of trajectory planning
under environmental disturbances, this paper presents a disturbance-aware
planning framework that incorporates real-time disturbance estimation into both
path searching and trajectory optimization. A key component of the framework is
a disturbance-adaptive safety boundary adjustment mechanism, which dynamically
modifies the vehicle's feasible dynamic boundaries based on estimated
disturbances to ensure trajectory feasibility. Leveraging the dynamics model of
the bimodal vehicle, the proposed approach achieves adaptive and reliable
motion planning across different terrains and operating conditions. A series of
real-world experiments and benchmark comparisons on a custom-built platform
validate the effectiveness and robustness of the method, demonstrating
improvements in tracking accuracy, task efficiency, and energy performance
under both ground and aerial disturbances.
☆ Social and Telepresence Robots for Accessibility and Inclusion in Small Museums
Nello Balossino, Rossana Damiano, Cristina Gena, Alberto Lillo, Anna Maria Marras, Claudio Mattutino, Antonio Pizzo, Alessia Prin, Fabiana Vernero
There are still many museums that present accessibility barriers,
particularly regarding perceptual, cultural, and cognitive aspects. This is
especially evident in low-density population areas. The aim of the ROBSO-PM
project is to improve the accessibility of small museums through the use of
social robots and social telepresence robots, focusing on three museums as case
studies: the Museum of the Holy Shroud in Turin, a small but globally known
institution, and two lesser known mountain museums: the Museum of the Champlas
du Col Carnival and the Pragelato Museum of Alpine Peoples' Costumes and
Traditions. The project explores two main applications for robots: as guides
supporting inclusive visits for foreign or disabled visitors, and as
telepresence tools allowing people with limited mobility to access museums
remotely. From a research perspective, key topics include storytelling, robot
personality, empathy, personalization, and, in the case of telepresence,
collaboration between the robot and the person, with clearly defined roles and
autonomy.
☆ Latent Policy Barrier: Learning Robust Visuomotor Policies by Staying In-Distribution
Visuomotor policies trained via behavior cloning are vulnerable to covariate
shift, where small deviations from expert trajectories can compound into
failure. Common strategies to mitigate this issue involve expanding the
training distribution through human-in-the-loop corrections or synthetic data
augmentation. However, these approaches are often labor-intensive, rely on
strong task assumptions, or compromise the quality of imitation. We introduce
Latent Policy Barrier, a framework for robust visuomotor policy learning.
Inspired by Control Barrier Functions, LPB treats the latent embeddings of
expert demonstrations as an implicit barrier separating safe, in-distribution
states from unsafe, out-of-distribution (OOD) ones. Our approach decouples the
role of precise expert imitation and OOD recovery into two separate modules: a
base diffusion policy solely on expert data, and a dynamics model trained on
both expert and suboptimal policy rollout data. At inference time, the dynamics
model predicts future latent states and optimizes them to stay within the
expert distribution. Both simulated and real-world experiments show that LPB
improves both policy robustness and data efficiency, enabling reliable
manipulation from limited expert data and without additional human correction
or annotation.
☆ Affordance-Guided Dual-Armed Disassembly Teleoperation for Mating Parts
Robotic non-destructive disassembly of mating parts remains challenging due
to the need for flexible manipulation and the limited visibility of internal
structures. This study presents an affordance-guided teleoperation system that
enables intuitive human demonstrations for dual-arm fix-and-disassemble tasks
for mating parts. The system visualizes feasible grasp poses and disassembly
directions in a virtual environment, both derived from the object's geometry,
to address occlusions and structural complexity. To prevent excessive position
tracking under load when following the affordance, we integrate a hybrid
controller that combines position and impedance control into the teleoperated
disassembly arm. Real-world experiments validate the effectiveness of the
proposed system, showing improved task success rates and reduced object pose
deviation.
comment: 6 pages, 9 figures
☆ Modular Vacuum-Based Fixturing System for Adaptive Disassembly Workspace Integration
The disassembly of small household appliances poses significant challenges
due to their complex and curved geometries, which render traditional rigid
fixtures inadequate. In this paper, we propose a modular vacuum-based fixturing
system that leverages commercially available balloon-type soft grippers to
conform to arbitrarily shaped surfaces and provide stable support during
screw-removal tasks. To enable a reliable deployment of the system, we develop
a stability-aware planning framework that samples the bottom surface of the
target object, filters candidate contact points based on geometric continuity,
and evaluates support configurations using convex hull-based static stability
criteria. We compare the quality of object placement under different numbers
and configurations of balloon hands. In addition, real-world experiments were
conducted to compare the success rates of traditional rigid fixtures with our
proposed system. The results demonstrate that our method consistently achieves
higher success rates and superior placement stability during screw removal
tasks.
comment: 8 pages, 9 figures
♻ ☆ LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation
Predictive manipulation has recently gained considerable attention in the
Embodied AI community due to its potential to improve robot policy performance
by leveraging predicted states. However, generating accurate future visual
states of robot-object interactions from world models remains a well-known
challenge, particularly in achieving high-quality pixel-level representations.
To this end, we propose LaDi-WM, a world model that predicts the latent space
of future states using diffusion modeling. Specifically, LaDi-WM leverages the
well-established latent space aligned with pre-trained Visual Foundation Models
(VFMs), which comprises both geometric features (DINO-based) and semantic
features (CLIP-based). We find that predicting the evolution of the latent
space is easier to learn and more generalizable than directly predicting
pixel-level images. Building on LaDi-WM, we design a diffusion policy that
iteratively refines output actions by incorporating forecasted states, thereby
generating more consistent and accurate results. Extensive experiments on both
synthetic and real-world benchmarks demonstrate that LaDi-WM significantly
enhances policy performance by 27.9\% on the LIBERO-LONG benchmark and 20\% on
the real-world scenario. Furthermore, our world model and policies achieve
impressive generalizability in real-world experiments.
comment: CoRL 2025
♻ ☆ An improved two-dimensional time-to-collision for articulated vehicles: predicting sideswipe and rear-end collisions
Time-to-collision (TTC) is a widely used measure for predicting rear-end
collisions, assuming constant speed and heading for both vehicles in the
prediction horizon. However, this conventional formulation cannot detect
sideswipe collisions. A two-dimensional extension, $\text{TTC}_{\text{2D}}$,
has been proposed in the literature to address lateral interactions. However,
this formulation assumes both vehicles have the same heading and that their
headings remain unchanged during the manoeuvre, in addition to the constant
speed and heading assumptions in the prediction horizon. Moreover, its use for
articulated vehicles like a tractor-semitrailer remains unclear. This paper
proposes three enhanced versions of $\text{TTC}_{\text{2D}}$ to overcome these
limitations. The first incorporates the vehicle heading to account for
directional differences. The standard assumption of constant speed and heading
in the prediction horizon holds. The second adapts the formulation for
articulated vehicles, and the third allows for constant acceleration, relaxing
the constant speed assumption in the prediction horizon. All versions are
evaluated in simulated cut-in scenarios, covering both sideswipe and rear-end
collisions, using the CARLA simulation environment with a tractor-semitrailer
model. Results show that the proposed versions predict sideswipe collisions
with better accuracy compared to existing $\text{TTC}_{\text{2D}}$. They also
detect rear-end collisions similar to the existing methods.
♻ ☆ Failure-Aware Multi-Robot Coordination for Resilient and Adaptive Target Tracking
Multi-robot coordination is crucial for autonomous systems, yet real-world
deployments often encounter various failures. These include both temporary and
permanent disruptions in sensing and communication, which can significantly
degrade system robustness and performance if not explicitly modeled. Despite
its practical importance, failure-aware coordination remains underexplored in
the literature. To bridge the gap between idealized conditions and the
complexities of real-world environments, we propose a unified failure-aware
coordination framework designed to enable resilient and adaptive multi-robot
target tracking under both temporary and permanent failure conditions. Our
approach systematically distinguishes between two classes of failures: (1)
probabilistic and temporary disruptions, where robots recover from intermittent
sensing or communication losses by dynamically adapting paths and avoiding
inferred danger zones, and (2) permanent failures, where robots lose sensing or
communication capabilities irreversibly, requiring sustained, decentralized
behavioral adaptation. To handle these scenarios, the robot team is partitioned
into subgroups. Robots that remain connected form a communication group and
collaboratively plan using partially centralized nonlinear optimization. Robots
experiencing permanent disconnection or failure continue to operate
independently through decentralized or individual optimization, allowing them
to contribute to the task within their local context. We extensively evaluate
our method across a range of benchmark variations and conduct a comprehensive
assessment under diverse real-world failure scenarios. Results show that our
framework consistently achieves robust performance in realistic environments
with unknown danger zones, offering a practical and generalizable solution for
the multi-robot systems community.
♻ ☆ Would you let a humanoid play storytelling with your child? A usability study on LLM-powered narrative Human-Robot Interaction
Maria Lombardi, Carmela Calabrese, Davide Ghiglino, Caterina Foglino, Davide De Tommaso, Giulia Da Lisca, Lorenzo Natale, Agnieszka Wykowska
A key challenge in human-robot interaction research lies in developing
robotic systems that can effectively perceive and interpret social cues,
facilitating natural and adaptive interactions. In this work, we present a
novel framework for enhancing the attention of the iCub humanoid robot by
integrating advanced perceptual abilities to recognise social cues, understand
surroundings through generative models, such as ChatGPT, and respond with
contextually appropriate social behaviour. Specifically, we propose an
interaction task implementing a narrative protocol (storytelling task) in which
the human and the robot create a short imaginary story together, exchanging in
turn cubes with creative images placed on them. To validate the protocol and
the framework, experiments were performed to quantify the degree of usability
and the quality of experience perceived by participants interacting with the
system. Such a system can be beneficial in promoting effective human robot
collaborations, especially in assistance, education and rehabilitation
scenarios where the social awareness and the robot responsiveness play a
pivotal role.
♻ ☆ Learning to Initialize Trajectory Optimization for Vision-Based Autonomous Flight in Unknown Environments IROS 2025
Autonomous flight in unknown environments requires precise spatial and
temporal trajectory planning, often involving computationally expensive
nonconvex optimization prone to local optima. To overcome these challenges, we
present the Neural-Enhanced Trajectory Planner (NEO-Planner), a novel approach
that leverages a Neural Network (NN) Planner to provide informed initial values
for trajectory optimization. The NN-Planner is trained on a dataset generated
by an expert planner using batch sampling, capturing multimodal trajectory
solutions. It learns to predict spatial and temporal parameters for
trajectories directly from raw sensor observations. NEO-Planner starts
optimization from these predictions, accelerating computation speed while
maintaining explainability. Furthermore, we introduce a robust online
replanning framework that accommodates planning latency for smooth trajectory
tracking. Extensive simulations demonstrate that NEO-Planner reduces
optimization iterations by 20%, leading to a 26% decrease in computation time
compared with pure optimization-based methods. It maintains trajectory quality
comparable to baseline approaches and generalizes well to unseen environments.
Real-world experiments validate its effectiveness for autonomous drone
navigation in cluttered, unknown environments.
comment: Accepted to IROS 2025. Source code available
♻ ☆ Unified Multi-Rate Model Predictive Control for a Jet-Powered Humanoid Robot RAS 24
We propose a novel Model Predictive Control (MPC) framework for a jet-powered
flying humanoid robot. The controller is based on a linearised centroidal
momentum model to represent the flight dynamics, augmented with a second-order
nonlinear model to explicitly account for the slow and nonlinear dynamics of
jet propulsion. A key contribution is the introduction of a multi-rate MPC
formulation that handles the different actuation rates of the robot's joints
and jet engines while embedding the jet dynamics directly into the predictive
model. We validated the framework using the jet-powered humanoid robot iRonCub,
performing simulations in Mujoco; the simulation results demonstrate the
robot's ability to recover from external disturbances and perform stable,
non-abrupt flight manoeuvres, validating the effectiveness of the proposed
approach.
comment: This paper has been accepted for publication at the 2025 IEEE-RAS
24th International Conference on Humanoid Robots (Humanoids), Seoul, 2025
♻ ☆ RoboTron-Nav: A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction ICCV 2025
In language-guided visual navigation, agents locate target objects in unseen
environments using natural language instructions. For reliable navigation in
unfamiliar scenes, agents should possess strong perception, planning, and
prediction capabilities. Additionally, when agents revisit previously explored
areas during long-term navigation, they may retain irrelevant and redundant
historical perceptions, leading to suboptimal results. In this work, we propose
RoboTron-Nav, a unified framework that integrates perception, planning, and
prediction capabilities through multitask collaborations on navigation and
embodied question answering tasks, thereby enhancing navigation performances.
Furthermore, RoboTron-Nav employs an adaptive 3D-aware history sampling
strategy to effectively and efficiently utilize historical observations. By
leveraging large language model, RoboTron-Nav comprehends diverse commands and
complex visual scenes, resulting in appropriate navigation actions.
RoboTron-Nav achieves an 81.1% success rate in object goal navigation on the
$\mathrm{CHORES}$-$\mathbb{S}$ benchmark, setting a new state-of-the-art
performance. Project page: https://yvfengzhong.github.io/RoboTron-Nav
comment: ICCV 2025
♻ ☆ MBA-SLAM: Motion Blur Aware Gaussian Splatting SLAM
Emerging 3D scene representations, such as Neural Radiance Fields (NeRF) and
3D Gaussian Splatting (3DGS), have demonstrated their effectiveness in
Simultaneous Localization and Mapping (SLAM) for photo-realistic rendering,
particularly when using high-quality video sequences as input. However,
existing methods struggle with motion-blurred frames, which are common in
real-world scenarios like low-light or long-exposure conditions. This often
results in a significant reduction in both camera localization accuracy and map
reconstruction quality. To address this challenge, we propose a dense visual
deblur SLAM pipeline (i.e. MBA-SLAM) to handle severe motion-blurred inputs and
enhance image deblurring. Our approach integrates an efficient motion
blur-aware tracker with either neural radiance fields or Gaussian Splatting
based mapper. By accurately modeling the physical image formation process of
motion-blurred images, our method simultaneously learns 3D scene representation
and estimates the cameras' local trajectory during exposure time, enabling
proactive compensation for motion blur caused by camera movement. In our
experiments, we demonstrate that MBA-SLAM surpasses previous state-of-the-art
methods in both camera localization and map reconstruction, showcasing superior
performance across a range of datasets, including synthetic and real datasets
featuring sharp images as well as those affected by motion blur, highlighting
the versatility and robustness of our approach. Code is available at
https://github.com/WU-CVGL/MBA-SLAM.
comment: Accepted to TPAMI; Deblur Gaussian Splatting SLAM
♻ ☆ CARE: Enhancing Safety of Visual Navigation through Collision Avoidance via Repulsive Estimation
We propose CARE (Collision Avoidance via Repulsive Estimation) to improve the
robustness of learning-based visual navigation methods. Recently, visual
navigation models, particularly foundation models, have demonstrated promising
performance by generating viable trajectories using only RGB images. However,
these policies can generalize poorly to environments containing
out-of-distribution (OOD) scenes characterized by unseen objects or different
camera setups (e.g., variations in field of view, camera pose, or focal
length). Without fine-tuning, such models could produce trajectories that lead
to collisions, necessitating substantial efforts in data collection and
additional training. To address this limitation, we introduce CARE, an
attachable module that enhances the safety of visual navigation without
requiring additional range sensors or fine-tuning of pretrained models. CARE
can be integrated seamlessly into any RGB-based navigation model that generates
local robot trajectories. It dynamically adjusts trajectories produced by a
pretrained model using repulsive force vectors computed from depth images
estimated directly from RGB inputs. We evaluate CARE by integrating it with
state-of-the-art visual navigation models across diverse robot platforms.
Real-world experiments show that CARE significantly reduces collisions (up to
100%) without compromising navigation performance in goal-conditioned
navigation, and further improves collision-free travel distance (up to 10.7x)
in exploration tasks. Project page: https://airlab-sogang.github.io/CARE/
comment: 16 pages, 6 figures
♻ ☆ GhostShell: Streaming LLM Function Calls for Concurrent Embodied Programming
Jian Gong, Youwei Huang, Bo Yuan, Ming Zhu, Zhou Liao, Jianhang Liang, Juncheng Zhan, Jinke Wang, Hang Shu, Mingyue Xiong, Yanjun Ye, Yufan Zu, Yang Zhou, Yihan Ding, Xuannian Chen, Xingyu Lu, Runjie Ban, Bingchao Huang, Fusen Liu
We present GhostShell, a novel approach that leverages Large Language Models
(LLMs) to enable streaming and concurrent behavioral programming for embodied
systems. In contrast to conventional methods that rely on pre-scheduled action
sequences or behavior trees, GhostShell drives embodied systems to act
on-the-fly by issuing function calls incrementally as tokens are streamed from
the LLM. GhostShell features a streaming XML function token parser, a dynamic
function interface mapper, and a multi-channel scheduler that orchestrates
intra-channel synchronous and inter-channel asynchronous function calls,
thereby coordinating serial-parallel embodied actions across multiple robotic
components under LLM guidance. We evaluate GhostShell on our robotic prototype
COCO through comprehensive grounded experiments across 34 real-world
interaction tasks and multiple LLM backends. The results demonstrate that our
approach achieves a state-of-the-art Behavioral Correctness Metric of 0.85 with
Claude-4-Sonnet, and up to 66X faster response times compared to native LLM
function calling APIs. GhostShell also proves effective in long-horizon
multimodal tasks, exhibiting strong robustness and generalization capabilities.
comment: 17 pages, 5 figures, conference
♻ ☆ CANVAS: Commonsense-Aware Navigation System for Intuitive Human-Robot Interaction ICRA 2025
Suhwan Choi, Yongjun Cho, Minchan Kim, Jaeyoon Jung, Myunchul Joe, Yubeen Park, Minseo Kim, Sungwoong Kim, Sungjae Lee, Hwiseong Park, Jiwan Chung, Youngjae Yu
Real-life robot navigation involves more than just reaching a destination; it
requires optimizing movements while addressing scenario-specific goals. An
intuitive way for humans to express these goals is through abstract cues like
verbal commands or rough sketches. Such human guidance may lack details or be
noisy. Nonetheless, we expect robots to navigate as intended. For robots to
interpret and execute these abstract instructions in line with human
expectations, they must share a common understanding of basic navigation
concepts with humans. To this end, we introduce CANVAS, a novel framework that
combines visual and linguistic instructions for commonsense-aware navigation.
Its success is driven by imitation learning, enabling the robot to learn from
human navigation behavior. We present COMMAND, a comprehensive dataset with
human-annotated navigation results, spanning over 48 hours and 219 km, designed
to train commonsense-aware navigation systems in simulated environments. Our
experiments show that CANVAS outperforms the strong rule-based system ROS
NavStack across all environments, demonstrating superior performance with noisy
instructions. Notably, in the orchard environment, where ROS NavStack records a
0% total success rate, CANVAS achieves a total success rate of 67%. CANVAS also
closely aligns with human demonstrations and commonsense constraints, even in
unseen environments. Furthermore, real-world deployment of CANVAS showcases
impressive Sim2Real transfer with a total success rate of 69%, highlighting the
potential of learning from human demonstrations in simulated environments for
real-world applications.
comment: Accepted to ICRA 2025, project page https://worv-ai.github.io/canvas
♻ ☆ Direct Robot Configuration Space Construction using Convolutional Encoder-Decoders ICML 2025
Intelligent robots must be able to perform safe and efficient motion planning
in their environments. Central to modern motion planning is the configuration
space. Configuration spaces define the set of configurations of a robot that
result in collisions with obstacles in the workspace, $\text{C}_{\text{clsn}}$,
and the set of configurations that do not, $\text{C}_{\text{free}}$. Modern
approaches to motion planning first compute the configuration space and then
perform motion planning using the calculated configuration space. Real-time
motion planning requires accurate and efficient construction of configuration
spaces.
We are the first to apply a convolutional encoder-decoder framework for
calculating highly accurate approximations to configuration spaces, essentially
learning how the robot and physical world interact. Our model achieves an
average 97.5% F1-score for predicting $\text{C}_{\text{free}}$ and
$\text{C}_{\text{clsn}}$ for 2-D robotic workspaces with a dual-arm robot. Our
method limits undetected collisions to less than 2.5% on robotic workspaces
that involve translation, rotation, and removal of obstacles. Our model learns
highly transferable features between robotic workspaces, requiring little to no
fine-tuning to adapt to new transformations of obstacles in the workspace.
comment: 8 pages, 7 figures, 4 tables; Appeared at the ICML 2025 Workshop on
Building Physically Plausible World Models
♻ ☆ Human-Machine Shared Control Approach for the Takeover of CACC
Cooperative Adaptive Cruise Control (CACC) often requires human takeover for
tasks such as exiting a freeway. Direct human takeover can pose significant
risks, especially given the close-following strategy employed by CACC, which
might cause drivers to feel unsafe and execute hard braking, potentially
leading to collisions. This research aims to develop a CACC takeover controller
that ensures a smooth transition from automated to human control. The proposed
CACC takeover maneuver employs an indirect human-machine shared control
approach, modeled as a Stackelberg competition where the machine acts as the
leader and the human as the follower. The machine guides the human to respond
in a manner that aligns with the machine's expectations, aiding in maintaining
following stability. Additionally, the human reaction function is integrated
into the machine's predictive control system, moving beyond a simple
"prediction-planning" pipeline to enhance planning optimality. The controller
has been verified to i) enable a smooth takeover maneuver of CACC; ii) ensure
string stability in the condition that the platoon has less than 6 CAVs and
human control authority is less than 40%; iii) enhance both perceived and
actual safety through machine interventions; and iv) reduce the impact on
upstream traffic by up to 60%.
comment: This article has been published on IEEE Transactions on Intelligent
Transportation Systems (2025)
♻ ☆ ImLPR: Image-based LiDAR Place Recognition using Vision Foundation Models
LiDAR Place Recognition (LPR) is a key component in robotic localization,
enabling robots to align current scans with prior maps of their environment.
While Visual Place Recognition (VPR) has embraced Vision Foundation Models
(VFMs) to enhance descriptor robustness, LPR has relied on task-specific models
with limited use of pre-trained foundation-level knowledge. This is due to the
lack of 3D foundation models and the challenges of using VFM with LiDAR point
clouds. To tackle this, we introduce ImLPR, a novel pipeline that employs a
pre-trained DINOv2 VFM to generate rich descriptors for LPR. To the best of our
knowledge, ImLPR is the first method to utilize a VFM for LPR while retaining
the majority of pre-trained knowledge. ImLPR converts raw point clouds into
novel three-channel Range Image Views (RIV) to leverage VFM in the LiDAR
domain. It employs MultiConv adapters and Patch-InfoNCE loss for effective
feature learning. We validate ImLPR on public datasets and outperform
state-of-the-art (SOTA) methods across multiple evaluation metrics in both
intra- and inter-session LPR. Comprehensive ablations on key design choices
such as channel composition, RIV, adapters, and the patch-level loss quantify
each component's impact. We release ImLPR as open source for the robotics
community: https://github.com/minwoo0611/ImLPR.
comment: CoRL2025 Accepted, 23 Pages, 15 Figures and 14 Tables