‘Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning’

“Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, but counteracting off-policy bias without exacerbating variance is challenging. Classically, off-policy bias is corrected in a per-decision manner. … Many off-policy algorithms rely on this mechanism, along with differing protocols for cutting the IS ratios to combat the variance of the IS estimator. Unfortunately, once a trace has been fully cut, the effect cannot be reversed. … In this paper, we propose a multistep operator that can express both per-decision and trajectory-aware methods.”

Read the paper and see the full list of authors in ArXiv.

View on Site: ‘Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning’