Strategy complexity of lim sup and lim inf payoff objectives in countable MDPs
View/ Open
Date
30/08/2023Author
Munday, Eric
Metadata
Abstract
We study countably infinite Markov decision processes (MDPs) with real-valued tran-
sition rewards. A strategy is a function which decides how plays proceed within the
MDP. Every strategy induces a set of infinite runs in the MDP and each infinite run
induces the following sequences of payoffs:
1. Point payoff (the sequence of directly seen transition rewards),
2. Mean payoff (the sequence of the sums of all rewards so far, divided by the number
of steps), and
3. Total payoff (the sequence of the sums of all rewards so far).
For each payoff type, the threshold objective is to maximise the probability that the
lim sup/lim inf is non-negative. We are interested in the strategy complexity of the
above objectives, i.e. the amount of memory and/or randomisation that a strategy
needs access to in order to play well (optimally resp. ε-optimally). Our results seek
not only to decide whether an objective requires finite or infinite memory, but in the
case of infinite memory, what kind of infinite memory is necessary and sufficient. For
example, a step counter which acts as a clock, or a reward counter which sums up the
seen rewards may be sufficient.
We compare the lim sup/lim inf point payoff objectives to the Büchi/co-Büchi ob-
jectives which, given a set of states or transitions, seek to maximise the probability
that this set is visited infinitely/finitely often. Convergence effects are what differen-
tiate lim sup/lim inf point payoff objectives from Büchi/co-Büchi. For example, the
sequence −1/2, −1/3, −1/4 . . . does satisfy lim sup ≥ 0 and lim inf ≥ 0 despite all of
the rewards being negative. It is in dealing with these effects which we make our main
technical contributions. We establish a complete picture of the strategy complexity for
both the lim sup and lim inf point payoff objectives. In particular we show that opti-
mal lim sup requires either randomisation or access to a step counter and that lim inf
of point payoff requires a step counter (but not more) when the underlying MDP is
infinitely branching.
We also comprehensively pin down the strategy complexity for the lim inf total
and mean payoff objectives. This result requires a novel counterexample involving
unboundedly growing rewards as well as finely tuned transition probabilities which
force the player to use memory in order to mimic what occurred in past random events.
This allows us to show that both of these objectives require the use of both a step
counter as well as a reward counter.
We apply our results to solve two open problems from Sudderth [35] about the
strategy complexity of optimal strategies for the expected lim sup/lim inf point pay-
off. We achieve this by reducing each objective to its respective optimal threshold
lim sup/lim inf point payoff counterpart. Thus we are able to conclude that they share
the same optimal strategy complexity.