Download:
Abstract:
We consider off-policy estimation of the expected reward of a target policy using samples collected by a different behavior policy. Importance sampling (IS) has been a key technique for deriving (nearly) unbiased estimators, but is known to suffer from an excessively high variance in long-horizon problems. In the extreme case of infinite-horizon problems, the variance of an IS-based estimator may even be unbounded. In this paper, we propose a new off-policy estimator that applies IS directly on the stationary state-visitation distributions to avoid the exploding variance faced by existing methods. Our key contribution is a novel approach to estimating the density ratio of two stationary state distributions, with trajectories sampled from only the behavior distribution. We develop a mini-max loss function for the estimation problem, and derive a closed-form solution for the case of RKHS. We support our method with both theoretical and empirical analyses.
Citation
@inproceedings{liu2018breaking,
title={Breaking the curse of horizon: Infinite-horizon off-policy estimation},
author={Liu, Qiang and Li, Lihong and Tang, Ziyang and Zhou, Dengyong},
booktitle={Advances in Neural Information Processing Systems},
pages={5356--5366},
year={2018}
}