This blog post explains one of my favorite open problems in Reinforcement Learning theory.
The purpose of this post is to introduce the relevant background, publicize some work in this area, and provide some (incomplete) opinions and thoughts on the open problems. My hope is that this blog post provides a starting point for future research.
I’ve tried to model this after similar blog posts from the Learning Theory Alliance. Another source of inspiration is the COLT Open Problem track (of course, this will be much less formal and not up to the same level of scrutiny).
Disclaimer. There is a lot of literature on RL, and admittedly I know only very little of it. If there is anything here which is inaccurate or incomplete, please email me!
Some Acknowledgements. The material is based on various discussions I’ve had over the years with Dylan Foster, Zeyu Jia, Ayush Sekhari, Akshay Krishnamurthy and Philip Amortila. Thanks in particular to Akshay who gave me comments on the first version of this post. I’d encourage the reader to check out their works (as well as the other papers, books, tutorials mentioned in this post)!
Introduction
Over the past decade, we’ve seen tremendous success in empirical RL applied to environments with large state spaces. At the heart of these advances is the use of function approximation (via deep neural networks) to enable generalization across states and actions.
Theoretically, our understanding is woefully behind. Perhaps it is too much to argue for a complete theoretical understanding of deep RL, but I’d even claim that RL theory is at odds with RL practice.
Why?
The short answer is that a lot of RL theory is built on a particular “completeness” assumption. (For now, “completeness” is in quotes because I haven’t given a formal definition yet.) For now, one can think of “completeness” as an extremely strong representational assumption on the function approximator class that the learner is given.
The high-level motivating question here is:
There are at least two formalizations of this question, which I will present here.
Preliminaries. I’ll assume that the reader is familiar with the basics of statistical learning theory and reinforcement learning. I’ll denote MDPs by the tuple $M = (\mathcal{X}, \mathcal{A}, H, P, r, d_1)$. We will assume that the action set has cardinality $A$, and that the per-trajectory rewards are bounded in $[0,1]$. For a given policy $\pi$ we denote the occupancy measure at layer $h$ as $d^\pi_h$. For value-function classes $\mathcal{Q}$ (resp. policy classes $\Pi$) we will use $\mathcal{Q} := \{ Q_h: Q \in \mathcal{Q} \}$ (resp. $\Pi_h := \{\pi_h: \pi \in \Pi \}$). We use $\mathcal{T}_h$ to denote the Bellman operator at layer $h \in [H]$, i.e., for any $f: \mathcal{X} \times \mathcal{A} \to [0,1]$ we have \([\mathcal{T}_h f](x,a) = R(x,a) + \mathbb{E}_{x' \sim P(x,a)} \max_{a'} f(x', a')\). We’ll focus on the PAC RL problem: given a policy class $\Pi$, find a $\varepsilon$-optimal policy $\hat{\pi}$ such that with probability at least $0.99$, \(J( \hat{\pi}) \ge \max_{\pi \in \Pi} J(\pi) - \varepsilon\).
Background: Challenges in RL
Modern RL in large state spaces has three fundamental challenges: generalization, exploration, and error amplification. First, I’ll go over what I mean by these three challenges, and introduce the sort of structural assumptions that are commonly used to address them in theory. (Here is a nice tutorial from FOCS 2020 that explains this perspective in more detail.)
Challenge 1: Generalization
In order to handle generalization across unseen instances, the typical approach in statistical learning assumes that the learner is given access to some hypothesis class $\mathcal{H}$. The class $\mathcal{H}$ encodes prior belief on what a good predictor might look like. Sample complexity guarantees for learning algorithms scale as $\log |\mathcal{H}|$ (and for infinite hypothesis classes, this can be replaced by some quantity like VC dimension, covering numbers, etc.)
In order to use function approximation in RL, the first question we should ask is: “what are we trying to approximate?” Since there is additional complexity in the RL problem due to the underlying MDP, there are three ways of using function approximation.
- Model-Based. The learner is given access to some MDP class $\mathcal{M}$ which models the unknown transitions $P: \mathcal{X} \times \mathcal{A} \to \Delta(\mathcal{X})$ and rewards $r: \mathcal{X} \times \mathcal{A} \to [0,1]$.
- Value-Based. The learner is given access to a value-function class $\mathcal{Q}$ which models the optimal Q-function $Q^\star: \mathcal{X} \times \mathcal{A} \to [0,1]$.
- Policy-Based. The learner is given access to a policy class $\Pi$ which models the optimal policy $\pi^\star: \mathcal{X} \to \mathcal{A}$.
Remark. Value-based and policy-based function approximation are often both considered “model-free” colloquially, but I’ll stick with value/policy to disambiguate them.
One can think of this as a hierarchy of sorts: if the learner is given an MDP class $\mathcal{M}$, then this implies a value-function class $\mathcal{Q}$ of size $| \mathcal{Q} | \le | \mathcal{M} |$, since we can build a value-function class by identifying an optimal Q-function for any MDP in our class. Similarly if the learner is given a value function class $\mathcal{Q}$, this implies a policy class $\Pi$ of size $| \Pi | \le | \mathcal{Q} |$. Policy-based learning is the “minimal assumption” (otherwise, the learner is in the tabular setting, where they must compete with all possible policies. This is known to require number of samples scaling with the state space size $|\mathcal{X}|$). Thus, these three ways of using function approximation are arranged from strongest to weakest.
For the rest of this discussion, we will be fine with getting sample complexity guarantees that scale logarithmically with the cardinality of the function approximator class (as well as polynomially with other problem parameters such as horizon $H$, accuracy $\varepsilon^{-1}$, and number of actions $A$). Roughly speaking, log cardinality encodes the complexity of “supervised learning” in the problem. Extending these guarantees to infinite classes can usually be accomplished via fairly standard uniform convergence arguments.
In typical learning theory fashion, we will also focus on abstract classes (with minimal assumptions on the precise structure of the class). There has been tons of work on RL with function approximator classes that have some kind of “linear” structure: see, e.g., Bellman rank [JKALS16], witness rank [SJKAL18], linear MDP [JYWY19], bilinear classes [DKLLMSW21]. Often, algorithms here achieve strong guarantees via exploiting said linear structure, but it is unclear if they can generalized in a satisfactory way.
Challenge 2: Exploration
Exploration is an critical component of RL algorithms, since learning agents must try out different actions in order to strategically explore the state space.
We’ll focus on structural conditions for exploration which are based on the idea of coverage. These conditions generally take the form of density ratios between two distributions. We state a classical notion of such a coverage condition.
Definition (Concentrability). Let $\mu = \{ \mu_h \}_{h \in [H]}$ be a distribution over states. The concentrability coefficient for $\mu$ with respect to a policy class $\Pi$ and MDP $M$ is defined as
\[C_\mathsf{conc}(\mu; \Pi, M) := \max_{\pi \in \Pi, h\in [H]} \Big\lVert \frac{d^\pi_h}{\mu_h} \Big\rVert_\infty.\]One can think of $\mu$ as some exploratory distribution that the learner has sampling access to. Concentrability plays a fundamental role in offline RL, where it measures the quality of the offline distribution. As one can see from the definition, the quantity $C_\mathsf{conc}$ will be small if $\mu$ allocates mass to all regions of the state space which are represented in the occupancy measures of $\Pi$.
A highly influential paper [XFBJK22] establishes that coverage conditions also plays a role even in online RL. To do so, they define a quantity called coverability, which merely posits the existence of a good offline distribution.
Definition (Coverability). The coverability coefficient for a policy class $\Pi$ and MDP $M$ is defined as
\[C_\mathsf{cov}(\Pi, M) := \inf_{\mu}\max_{\pi \in \Pi, h\in [H]} \Big\lVert \frac{d^\pi_h}{\mu_h} \Big\rVert_\infty.\]In words, $C_\mathsf{cov}$ is the best possible concentrability coefficient for a given $\Pi$ and $M$. An important feature of this definition is that even if a learner is told the problem instance has low coverability, they do not know the identity of the minimizer $\mu$, nor do they have explicit sampling access to it. Many examples where statistically tractable online RL is possible have been shown to satisfy coverability—see the paper for examples.
Personally, I find this definition quite illuminating, since it cleanly captures the intuition of how difficult it is to uniformly explore every policy in a given class. Note that the definition of concentrability does not require any additional function approximation in the form of a model class or value function class, making it a “minimal” definition.
For the interested reader, our paper [JLRSS23] establishes a worst-case (over MDPs) variant of coverability called the spanning capacity, and we study the role of spanning capacity in characterizing the minimax sample complexity of policy-based RL.
Challenge 3: Error Amplification
The last (and perhaps greatest) challenge in RL is how to tame error amplification due to the long horizon. The predominant way of addressing this is via representational conditions on the function approximator class.
In supervised learning, we typically study the agnostic setting. That is, we do not assume anything on how well the hypothesis class $\mathcal{H}$ represents nature itself (as given by the underlying distribution over labeled examples). Nonetheless, it is still possible to get learning guarantees, as shown by the seminal work of Vapnik and Chervonenkis.
In RL, the analogue for this is agnostic policy learning, where the learner is only given an arbitrary policy class $\Pi$ (the weakest function approximator), and asked to return a policy that competes with the best policy in $\Pi$. We also do not assume that $\Pi$ represents nature (i.e., the MDP) itself.
Unfortunately, the guarantees here can be quite pessimistic (often exponential in horizon for many policy classes of interest) [JLRSS23]. The reason for such a blowup is due to the long horizon, which can make it difficult to learn even relatively small policy classes. The canonical example here is “open-loop” policies $\Pi = \{a_0, a_1 \}^H$, which has statistical complexity $\log | \Pi | = O(H)$ but is hard to learn in the worst case if we consider the full “binary tree” MDP of depth $H$ and hide a reward at one of the leaves.
In some sense, this is to be expected, since we didn’t really make any assumptions at all. To improve upon these results, we can assume some form of realizability: that the “ground truth” lies in our function approximator class. There are three variants:
- model-based realizability (that $M \in \mathcal{M}$),
- value-based realizability (that $Q^\star \in \mathcal{Q}$),
- policy-based realizability (that $\pi^\star \in \Pi$).
The assumption of realizability is usually the starting point of many papers in RL. While stronger than the agnostic setting, realizability is perhaps still a reasonable assumption, since in modern practice, it is common to use huge large neural networks which conceivably have small misspecification error and are thus approximately realizable.
For model-based RL, there is a well-established theory based on the Decision Estimation Coefficient [FKQR23 , FGH23], which give sample-efficient algorithms under model-based realizability, as well as corresponding lower bounds.
What about for value-based or policy-based RL? Here, the situation is much less clear. Existing algorithms usually require a much stronger “completeness” condition in order to tame the error amplification issue. In what follows, we will formally introduce two variants of completeness:
- Bellman completeness, for value-based RL,
- policy completeness, for policy-based RL.
We will argue that these completeness assumptions are unnatural, and state the open problems which ask if these assumptions are necessary for sample-efficient RL.
Open Problem 1: The Necessity of Bellman Completeness
In order to tame error amplification in value-based RL, typically, many algorithms require the notorious Bellman completeness assumption.
Definition. A value function class $\mathcal{Q}$ satisfies Bellman completeness if it is closed under Bellman backups, i.e., for every $h \in [H]$, $q \in \mathcal{Q}_{h+1}$, $\mathcal{T}_h q \in \mathcal{Q}_h$.
A few remarks are in order.
- The commonly stated definition says that the value function class satisfies Bellman completeness, but this is slightly inaccurate: Bellman completeness is also a property of the underlying MDP itself (which enters in the expectation taken under the Bellman backup operator), not just the value function class.
- It can show that Bellman completeness implies value-based realizability, but it can be much stronger – it is easy to construct settings which satisfy realizability but not Bellman completeness.
- Roughly speaking, model-based realizability implies the existence of a Bellman complete value-function class of similar statistical complexity, but the converse is not true. See the discussion in [CJ19].
Bellman completeness is a strange assumption.
First of all, it is unclear whether it should hold (even approximately) in practice, or even in theory beyond simple linear settings! See [WWSK21] for some empirical evidence of this.
Furthermore, it is an example of a nonmonotonic assumption: if we start with a function class $\mathcal{Q}$ which satisfies Bellman completeness, and add a single function to it, it is unclear if the new $\mathcal{Q}’$ still satisfies Bellman completeness. Contrast this with realizability, which is trivially a monotonic assumption. This is seriously at odds with RL practice, where we would not expect our algorithms to break if we added a single function to the function approximator class (never mind massively overparameterizing everything).
Despite these flaws, Bellman completeness is widely considered a standard assumption. Currently, many algorithms for value-based RL minimize some kind of surrogate objective based on the Bellman error. Due to the double sampling problem, it is challenging to get accurate estimates of the Bellman error from online rollouts. Thus, Bellman completeness is introduced to address this issue.
We do not know if Bellman completeness is information theoretically necessary, or whether an algorithm works under coverability + value-function realizability:
$$\mathrm{poly} ( C_\mathsf{cov}, \log | \mathcal{Q} |, A, H, \varepsilon^{-1} )? $$
This problem is formally stated in [MFR24], where the authors conjecture that the answer is “no”. On the flip side, if the answer is “yes”, it seems that the algorithm that attains this guarantee must employ new ideas. In the rest of this section, I’ll provide an overview of some related works which address this problem under modified assumptions, as well as some speculation about how to approach the problem.
Background Results
Let’s take a look at some of the work which has addressed variants/modifications to Problem 1. To my knowledge, there is no work which directly addresses Problem 1.
- Offline RL. [CJ19] asked the analogue of Problem 1 for offline value-based RL, where existing analyses required Bellman completeness or stronger coverage conditions [XJ21, ZHHJL22]. Offline RL is an easier setting to study since there is no interactivity, and a negative answer is provided in [FKSLX21] and [JRSW24].
- Online RL. The paper [XFBJK22], which introduced coverability, shows that the GOLF algorithm [JLM21] achieves sample-efficient learning under coverability + Bellman completeness. A followup work [AFJSX24] shows a positive result under coverability + value-function realizability + density-ratio realizability (having additional function approximation for the class of density ratios $d^{\pi}/d^{\pi’}$ for pairs of policies $\pi, \pi’ \in \Pi$.)
- Local Resets. When the learner is allowed additional access to the MDP in the form of local resets (ability to perform rollouts from states already observed in the dataset), [MFR24] show that a variant of GOLF achieves sample-efficient learning. In a nutshell, local resets allows the learner to circumvent the double-sampling error by data collection, instead of by assuming Bellman completeness.
What would a lower bound look like?
Let’s suppose we believe that a minimax lower bound holds of the following flavor: it is possible to construct a class of MDPs $\mathcal{M}$ such that:
\[\inf_{\mathsf{Alg}} \sup_{M \in \mathcal{M}} \mathbb{E}^{M, \mathsf{Alg}}[ \text{online samples to solve PAC RL} ] \ge \cdots. \tag{1}\]I will state two observations that might be a good starting point.
Observation 1 (Function class sizes). I claim that the associated function classes must (roughly) have the following sizes:
\[2^{2^H} \approx | \mathcal{M} | \gg | \mathcal{Q}_\mathcal{M} | = |\Pi_\mathcal{M} | \approx 2^H. \tag{2}\]Let’s break this down. We know that model-based realizability suffices for learning. In particular, it is known that one can get sample complexity which is $\mathrm{poly}(C_{\mathsf{cov}}, \log |\mathcal{M}| ).$ In order for the proposed lower bound (1) to hold without contradicting this, it must be the case that $|\mathcal{M}|$ is exponentially larger than $|\mathcal{Q}_{\mathcal{M}}|$, since many MDPs in the class $\mathcal{Q}$ all map to the same candidate Q function in $\mathcal{Q}_{\mathcal{M}}$. As for why $| \mathcal{Q}_{\mathcal{M}} | = | \Pi_{\mathcal{M}} | \approx 2^H$, we need the induced policy class to still be exponentially large in order to rule out the learner just trying out each policy one-by-one to learn the optimal policy.
Satisfying (2) seems quite challenging to do, and I am not aware of any sort of constructions in RL which give something like this.
The most similar kind of construction that I know of is the “rich observation combination lock” which is established in a series of work [KAL16, DMMSS21, KLS25]. The rich observation combination lock provides lower bounds which have the following function class sizes (compare to (2)):
\[2^{2^H} \approx |\mathcal{M}| = |\mathcal{Q}_\mathcal{M}| \gg |\Pi_\mathcal{M}| \approx 2^H. \tag{3}\]This kind of construction is used to prove lower bounds for policy-based learning (without contradicting model-based or value-based guarantees). Briefly, it involves constructing block MDPs without a decoder class, so that there are a huge number of $Q$ functions which all collapse onto a much smaller set of policies.
However, this idea doesn’t work here because we want to prove separations between model-based and value-based RL, so we need to ensure that $|\mathcal{M}|$ is much larger than $|\mathcal{Q}_\mathcal{M}|$.
Observation 2 (Online vs. simulator). [MFR24] observed that if the learner is allowed local simulator access, then the guarantee in Problem 1 is attainable. Hence, our conjectured lower bound must exhibit a separation between online and simulator access. Conceptually, simulator access makes RL much easier, but surprisingly, there are not many lower bounds like this that exist.
- One starting point may be [JLRSS23], but the construction there is similar to (3). Furthermore, it is somewhat involved and tailored to the policy learning setting; I have not been able to try to extend/generalize it to this setting.
- Another construction which exhibits this kind of separation is [WWK21], but this seems to be specific to the linear $Q^\star$ setting.
Taking a step back, I feel like our current understanding of the differences between online vs. simulator access in RL is lacking. Perhaps resolving Problem 1 would clarify our understanding. See this blog post by Nan Jiang for a complementary perspective on the role of simulators for RL research.
Open Problem 2: The Necessity of Policy Completeness
What if the learner only has a policy class? In joint work with Akshay Krishnamurthy and Ayush Sekhari, we explore this question in our paper.
Many results in policy-based RL, either for abstract policy classes or explicit parameterizations, focus on what is called the $\mu$-reset setting. I’ll introduce it now.
Definition. In the $\mu$-reset setting, the learner is given access to exploratory distributions $\mu = \{\mu_h\}_{h \in [H]}$, from which they are allowed to collect online rollouts from.
Why not ask whether it is possible to get $\mathrm{poly} ( C_\mathsf{cov}, \log | \Pi |, A, H, \varepsilon^{-1} )$ sample complexity in the standard online RL setting, without assuming explicit access to these exploratory distributions? A lower bound result by [DMMSS21] shows this is impossible (see also discussion in [JLRSS]). Furthermore, even with simulator access, we show that it is information theoretically impossible to adapt to coverability in policy-based RL (a striking contrast with the positive results in [MFR24] for value-based RL). Instead, one can only hope to get bounds which depend on the spanning capacity (which could be much larger than coverability parameter $C_\mathsf{cov}$). Morally speaking, this means is that “policy-based methods cannot explore”.
So it makes sense to focus on the $\mu$-reset setting. Our goal is to get bounds which depend polynomially on $C_\mathsf{conc}$ (the concentrability coefficient of $\mu$).
What is known classically for abstract policy classes? Here, there are two results here of similar flavor: Policy Search by Dynamic Programming and Conservative Policy Iteration. Both of them require a completeness condition:
Definition. We say a policy class $\Pi$ satisfies policy completeness if for every $\pi \in \Pi$ and $h \in [H]$: $x_h \mapsto \arg\max_{a} Q^{\pi}(x_h,a) \in \Pi_h$.
Policy completeness plays a similar role to Bellman completeness in taming error amplification. Essentially, since PSDP and CPI are built on the idea of “policy improvement”, this condition allows one to iteratively update an estimated policy without incurring a blowup in the complexity. Similar to the value-based setting, policy completeness implies policy realizability, but can be a much more stringent requirement.
In our work, we asked if it is possible to totally remove the policy completeness assumption and achieve sample efficient learning with arbitrary policy classes (the agnostic setting). We proved an information theoretic lower bound which rules this out.
What about realizability, which lies between policy completeness and the agnostic setting?
Unfortunately, our lower bound for the agnostic setting cannot be readily extended to the realizable setting, as the construction crucially relies on the non-realizability of the specific policy class. Interestingly, we also show that PSDP/CPI cannot work by providing an algorithm-dependent lower bound. Our lower bound can easily be broken by a trivial algorithm that learns the optimal action at layer 1.
An information-theoretic resolution is still open:
$$\mathrm{poly} ( C_\mathsf{conc}, \log |\Pi|, A, H, \varepsilon^{-1} )? $$
This is a policy-based analogue of Problem 1, and we suspect that the answer is no!
The situation here is quite similar to one in value-based RL, where it is also unknown whether just realizability is sufficient. As another point of comparison, our work also provides evidence that local resets can help (cf. [MFR24] for value-based RL). However, I do not know if there are any formal reductions/relationships between these two problems, since they operate in slightly different settings ($\mu$-reset vs online). For example, something that might be interesting is to see if one can use a realizable value-function class $\mathcal{Q}$ to construct a sampler from $\mu$-reset distribution for the induced policy class $\Pi_\mathcal{Q}$, using only online access.
Conclusion
To summarize, I’ve explained two open problems in RL theory. I would go even as far as to say that these are the most fundamental problems in RL theory right now, since they address a glaring deficiency in our understanding. Answering these problems may give us new algorithmic insights and clarify the power/limitations of different interaction protocols.
They are also just fun math problems to think about, and I hope people work on them!