A Comparison between Sarsa and Expected Sarsa
A theoretical and practical analysis of differences between Sarsa and expected Sarsa
Authors: Jeroen van Wely, Niek IJzerman & Jochem Soons
This article is written as part of the Reinforcement Learning course at the University of Amsterdam, within the Master’s programme Artificial Intelligence.
Introduction
In this article, we compare two variants of a well-known reinforcement learning technique: Sarsa and Expected Sarsa. We will first discuss the theory around these two methods to give a picture of their properties. We will then compare both methods theoretically to form a hypothesis that explains the main trade-off between both methods. Subsequently, we will present multiple empirical experiments and their corresponding results to test our hypothesis. Finally, we will discuss our findings and conclude our comparative study.
Sarsa
We start at the central pillar of this article: Sarsa. Sarsa is an on-policy TD method for control. On-policy means that the policy from which we sample our experiences is equal to the policy we are trying to optimize. Control here means that Sarsa can be used to learn the state-action function Q, from which we can derive our policy. If we take a look at the update step of the state-action values it becomes clear why we use the name Sarsa:
Here Q(S, A) is the value for state S and action A, gamma is a discount factor that allows us to control how much importance we attach to future rewards, and alpha is the step size parameter (learning rate). We can see here that the update takes into account:
- the current state (S)
- the current action (A)
- the reward for going to the next state (R)
- the next state (S’)
- and the next action (A’).
Which together spells SARSA. How do we select the next state and action? This becomes clear if we take a look at the entire algorithm presented in Appendix A. Here we can see that we just sample our state and action from our current policy derived from Q. Of course, we want to do some exploring, and therefore sampling states and actions is done stochastically. Usually, Sarsa is implemented using an epsilon-greedy policy over our state-action values. We follow this implementation of Sarsa within this article. Now that we understand Sarsa, we can move on to the variant we want to compare it with: expected Sarsa.
Expected Sarsa
Expected Sarsa is very similar to Sarsa. However, instead of state-action values being stochastically sampled using our current policy, it computes the expected value over all future state-action pairs, thereby considering how likely each action is under the current policy. The update-step is now defined as:
Plugging in this update step leads to the full algorithm for expected Sarsa as provided in Appendix A.
Sarsa vs. Expected Sarsa: the theoretical trade-off
As discussed in the introduction, the goal of this article is to compare the expected Sarsa algorithm with its regular counterpart. We explained the update steps of both methods in the previous section. Both Sarsa and expected Sarsa can be used as on-policy methods for control. As explained in the previous section, the difference between both methods is that Sarsa samples subsequent state-action pairs stochastically, while expected Sarsa computes the expected value of state-action values. Therefore, expected Sarsa is computationally more complex than Sarsa, but, in return, it eliminates the variance Sarsa exhibits due to its stochastic selection of subsequent actions.
This results in a theoretical trade-off between sample efficiency and compute-time between the two methods. On the one hand, expected Sarsa is computationally more complex and therefore slower than Sarsa. On the other hand, expected Sarsa generally outperforms Sarsa when they are given the same amount of experience, as it computes the weighted sum over actions instead of noisy sampling them. This results in the following hypothesis:
When comparing Sarsa and expected Sarsa, we expect to find a trade-off between sample efficiency and compute-time: Sarsa is more efficient computationally, while expected Sarsa will perform better with less experience.
We further expect to find the following aspects when testing this hypothesis:
- The difference in computational costs between the two methods will increase when the action-space of an environment increases. Our reasoning: computing the expected value over actions in environments with a larger number of actions is relatively more costly when compared to simple environments, while the computational costs of stochastic sampling over actions does not change when we change the number of actions.
- Larger environment stochasticity causes a smaller performance difference between both methods [2]. Our reasoning: stochasticity within an environment ensures that expected Sarsa is influenced by noise of randomness too, which limits the performance difference relative to regular Sarsa.
- Larger policy stochasticity causes a bigger performance difference between both methods [2]. Our reasoning: increased randomness of the policy (e.g. a larger epsilon) will ensure that Sarsa samples more ‘bad’ actions that hinder its performance compared to expected Sarsa that averages out actions.
In the following sections of this article, we will discuss some experiments in simulated agent environments to empirically compare results of both methods, to see if our theoretical hypothesis holds its ground in practice.
Methods
Now that we have formulated our hypothesis, we can discuss the methodology of our experiments. In this article, we will implement three different experiments on the simulated Windy Gridworld problem. This is a relatively simple gridworld problem that is easy to implement while still giving meaningful results when comparing different methods, making it useful to test our theoretical expectations in practice. A complete overview of the wind gridworld problem’s outline is provided in Appendix B.
To thoroughly test our hypothesis, we will compare Sarsa and expected Sarsa in different experimental settings. To make a fair comparison, we use the same algorithmic hyperparameters for both methods in all experiments, unless we specifically state otherwise.
Hyperparameters:
- number of episodes = 1000
- step size (alpha) = 0.5
- discount factor (gamma) = 1.0
- epsilon = 0.1
Throughout all experiments, we will analyze sample efficiency by plotting the average reward against the number of episodes experienced. Moreover, we will analyze differences in computational costs by reporting the average runtime of a single update step and the total duration of running all episodes. To make results more reliable, we average results over 100 independent runs with different random seeds. Our code to reproduce these experiments has been made publicly available here. In the following sections, we discuss the setup and results of the three experiments one by one.
Experiment 1: deterministic windy gridworld
In our first experiment, we compare performance of Sarsa and expected Sarsa on the basic version of the windy gridworld problem, with no modifications. In figure 3 the average reward against the number of episodes is plotted, including standard deviation between runs. In table 1, at the end of the experiments section, the runtimes are reported (including those for the other experiments).
There are a few interesting things visible here. First of all, expected Sarsa outperforms Sarsa. Sarsa converges relatively slowly and achieves an average reward of around -22, while expected Sarsa converges faster and achieves an average reward of approximately -18. Furthermore, it is noticeable that expected Sarsa performs much more stable than Sarsa (as indicated by the standard deviation between different runs). This is in line with our hypothesis: Sarsa more often samples noisy actions to update its Q values, while expected Sarsa always considers all possible actions when updating values, which eliminates variance and causes much more stable results.
If we look at table 1, however, we can see that expected Sarsa is computationally more demanding than Sarsa as the duration per update step increases by 81.9%. Depending on the number of update steps necessary within a reinforcement learning problem, this difference could lead to a considerable increase in computational time.
Experiment 2: stochastic windy gridworld
In the first experiment, we already found support for the trade-off we described in our hypothesis. We now move on to our second environment, in which we perform two modifications to the basic setup of the previous experiment. First, we increase the stochasticity of the environment. We do this by adding the following rule to the windy gridworld problem: if we choose a certain action, we have a 20% chance to randomly end up in another state that we would have if we followed this action. Second, we increase the stochasticity of our policy. We achieve this by increasing the epsilon value within our epsilon-greedy policy, as using a higher epsilon will result in more randomness when choosing the next action.
Firstly, if we look at figure 4, we can observe the results of adding stochasticity to the environment. It is visible that both Sarsa and expected Sarsa show much more noise for their rewards over episodes, indicated by the high standard deviations. This is expected due to the added randomness in transitioning to the next state. Furthermore, the average rewards after convergence for both models are moving around -45. This result is in line with our hypothesis, which stated that adding stochasticity within an environment limits the performance difference expected Sarsa has relative to regular Sarsa.
Secondly, if we look at the computing times in table 1, we again find that expected Sarsa is computationally more heavy than regular Sarsa, as an update of expected Sarsa within this experiment takes 67% longer. Also, the total runtime of 100 runs of 1000 episodes increases by 50% if we apply expected Sarsa.
If we look at figures 5a and 5b, we can observe the results of our increased policy stochasticity experiment. Although both Sarsa and expected Sarsa perform worse with epsilon=0.3 and epsilon=0.5 in comparison to epsilon=0.1 (figure 3), it is noticeable that Sarsa is performing relatively worse compared to expected Sarsa when we increase epsilon. This can be seen from the magnification of the gap between the average reward of Sarsa and expected Sarsa. This behavior was expected as increased randomness of the policy ensures that Sarsa samples more ‘bad’ actions that hinder its performance compared to expected Sarsa. Computationally, expected Sarsa again is less efficient per update (an increase of 96% with epsilon=0.3 and 105% with epsilon=0.5, see table 1).
Experiment 3: increased action-space
In the third experiment, we modify the action-space within the windy gridworld environment. Instead of the four possible regular actions (i.e., left, right, up and down) we have added four actions that allow the agent to move diagonally (e.g., up and right simultaneously). The results of this experiment are shown in figure 6.
What draws attention immediately is that both Sarsa and expected Sarsa achieve higher rewards compared to experiment 1. This is not surprising because the extra actions introduced within the environments allow both models to find the goal with fewer steps.
However, the most interesting finding of this experiment can be found if we consider the times per update step of both models. We expected that the computational time per update step of expected Sarsa would increase relatively more than in previous experiments. Nevertheless, in table 1 we see that the time per update step in expected Sarsa ‘only’ increases by 82%. This is not a significant increase of difference in runtime, considering that previous experiments showed an increase of computational time from 67% up to 105%. We suspect that the cause for this behavior is that adding only four extra actions to the action-space does not significantly increase the complexity of computing the expected action value within the expected Sarsa pipeline. An environment with a significantly bigger action-space compared to windy gridworld might still prove that compute-time in expected Sarsa relatively increases in comparison to regular Sarsa when the action-space grows.
Conclusion
In this article, we have compared regular Sarsa with expected Sarsa. First, we derived a theoretical trade-off between the two methods: Sarsa is more efficient computationally, while expected Sarsa performs better with less experience. To prove this trade-off, we have conducted 3 different experiments in the windy gridworld setting. Our experiments have empirically validated the theoretical trade-off: throughout all experiments, expected Sarsa converged quicker and yielded higher average returns, while Sarsa proved to be computationally much more efficient. Depending on the application, this trade-off is important to consider if one chooses to implement one of these methods.
Moreover, our experiments confirmed two other hypotheses: 1) environments with higher stochasticity diminish the performance advantage of expected Sarsa, while 2) increased stochasticity in our policy increases the performance difference. Lastly, we also expected that environments with an increased action-space would increase the computational advantage Sarsa has over expected Sarsa. Although our experiments could not confirm this hypothesis, we think this difference would become apparent if we move away from simple “toy-like” problems such as windy gridworld towards more complex “real-life” problems in which the action-space increases drastically. Future work could serve to validate this.
References
[1] Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
[2] Van Seijen, H., Van Hasselt, H., Whiteson, S., & Wiering, M. (2009, March). A theoretical and empirical analysis of Expected Sarsa. In 2009 ieee symposium on adaptive dynamic programming and reinforcement learning (pp. 177–184). IEEE.
Appendix A: the full algorithms
Sarsa:
Expected Sarsa:
Appendix B: Windy Gridworld
Shown below in figure B.1 is the outline of the windy gridworld problem. The goal of the problem is to move from the starting cell (S) towards the goal cell (G), as in any simple gridworld problem. The four actions are moving right, up, left and down. This problem is an episodic task, with a constant reward of -1 for every move taken until the goal is reached. However, there is one extra challenge: there is a crosswind running upwards through the middle of the grid. The strength of this wind varies and is provided below each column as an integer value. The number denotes the number of cells shifted upwards when moving to this column. For example, if you are one cell to the right of the goal, then the action left takes you to the cell just above the goal.