Skip to content

Latest commit

 

History

History
58 lines (58 loc) · 2.63 KB

2021-03-18-chowdhury21b.md

File metadata and controls

58 lines (58 loc) · 2.63 KB
title abstract layout series publisher issn id month tex_title firstpage lastpage page order cycles bibtex_author author date address container-title volume genre issued pdf extras
Reinforcement Learning in Parametric MDPs with Exponential Families
Extending model-based regret minimization strategies for Markov decision processes (MDPs) beyond discrete state-action spaces requires structural assumptions on the reward and transition models. Existing parametric approaches establish regret guarantees by making strong assumptions about either the state transition distribution or the value function as a function of state-action features, and often do not satisfactorily capture classical problems like linear dynamical systems or factored MDPs. This paper introduces a new MDP transition model defined by a collection of linearly parameterized exponential families with $d$ unknown parameters. For finite-horizon episodic RL with horizon $H$ in this MDP model, we propose a model-based upper confidence RL algorithm (Exp-UCRL) that solves a penalized maximum likelihood estimation problem to learn the $d$-dimensional representation of the transition distribution, balancing the exploitation-exploration tradeoff using confidence sets in the exponential family space. We demonstrate the efficiency of our algorithm by proving a frequentist (worst-case) regret bound that is of order $\tilde O(d\sqrt{H^3 N})$, sub-linear in total time $N$, linear in dimension $d$, and polynomial in the planning horizon $H$. This is achieved by deriving a novel concentration inequality for conditional exponential families that might be of independent interest. The exponential family MDP model also admits an efficient posterior sampling-style algorithm for which a similar guarantee on the Bayesian regret is shown.
inproceedings
Proceedings of Machine Learning Research
PMLR
2640-3498
chowdhury21b
0
Reinforcement Learning in Parametric MDPs with Exponential Families
1855
1863
1855-1863
1855
false
Chowdhury, Sayak Ray and Gopalan, Aditya and Maillard, Odalric-Ambrym
given family
Sayak Ray
Chowdhury
given family
Aditya
Gopalan
given family
Odalric-Ambrym
Maillard
2021-03-18
Proceedings of The 24th International Conference on Artificial Intelligence and Statistics
130
inproceedings
date-parts
2021
3
18