Papers
arxiv:2602.23816

Learning to maintain safety through expert demonstrations in settings with unknown constraints: A Q-learning perspective

Published on Feb 27
Authors:
,

Abstract

A safe Q-inverse constrained reinforcement learning algorithm is developed to learn policies that balance reward maximization with safety constraints using demonstrated trajectories and Q-value assessments of state-action pairs.

AI-generated summary

Given a set of trajectories demonstrating the execution of a task safely in a constrained MDP with observable rewards but with unknown constraints and non-observable costs, we aim to find a policy that maximizes the likelihood of demonstrated trajectories trading the balance between being conservative and increasing significantly the likelihood of high-rewarding trajectories but with potentially unsafe steps. Having these objectives, we aim towards learning a policy that maximizes the probability of the most promising trajectories with respect to the demonstrations. In so doing, we formulate the ``promise" of individual state-action pairs in terms of Q values, which depend on task-specific rewards as well as on the assessment of states' safety, mixing expectations in terms of rewards and safety. This entails a safe Q-learning perspective of the inverse learning problem under constraints: The devised Safe Q Inverse Constrained Reinforcement Learning (SafeQIL) algorithm is compared to state-of-the art inverse constraint reinforcement learning algorithms to a set of challenging benchmark tasks, showing its merits.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.23816 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.23816 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.