Abstract:
Mirror descent methods form the basis of some of the most efficient Reinforcement Learning (RL) algorithms. We study policy mirror descent for entropy-regularised, infinite-time-horizon, discounted Markov decision problems (MDPs) formulated on general Polish state and action spaces. The continuous-time version of the mirror descent flow is a Fisher--Rao gradient flow on the space of conditional probability measures.
In this talk, we show that the Fisher--Rao gradient flow converges exponentially (i.e., linearly) towards the optimal solution of the MDP. This is achieved even though the optimization objective is not convex by utilising the structure of the problem revealed by the performance difference lemma. Mathematically, the main difficulty stems from proving the well-posedness of the Fisher--Rao flow and the differentiability of the value function and Bregman divergence along the flow. The dual flow plays a key role in overcoming these difficulties.
We also prove the stability of the flow under perturbations of the gradient, arising, for example, from sampling or approximation of the Q-function. Specifically, we show that the value function converges exponentially fast to the optimal value function up to a bounded accumulated error associated with the perturbation. The stability result can be applied to deduce the convergence of natural policy gradient flows for more practical policy parameterizations. We study one possible application to log-linear policy classes and attain exponential convergence plus an additional error arising from a compatible function approximation.
This is joint work with James-Michael Leahy, Bekzhan Kerimkulov, Lukasz Spruch and Yufei Zhang.