Skip to content

Solutions to the exercises in Sutton & Barto's textbook Reinforcement Learning: An Introduction

License

Notifications You must be signed in to change notification settings

DavidBellamy/suttonbarto

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Setup

From the root of this repo, do:

python3.10 -m venv .venv
source .venv/bin/activate
pip install -e .
pip install -r requirements.txt

Exercise Discussions

For discussions of specific exercises, see their respective folders. Note that I will add solutions to exercises pre-2.5 soon!

Chapter Exercise Topic Notes
2 2.5 Non-stationary bandit Random walk on values -> noisier, slower convergence
2 2.6 Mysterious Spikes ↑ initial optimism $\implies$ ↑ initial determinism in greedy policy
2 2.7 Unbiased Constant-Step-Size Trick Divide $\alpha$ by a function $f: \N \rarr [0, 1) $ which tends to 1 from below. Offsets bias of $Q_1$ by permitting larger-than-$\alpha$ updates at first, converging to $\alpha$-sized updates in the limit.
2 2.8 UCB Spikes ↑ exploration budget $\implies$ larger initial spike since bandit is 'forced off' the action with highest empirical mean
2 2.9 Gradient Bandits 2-variable softmax $\equiv$ sigmoid/logistic function
2 2.10 Contextual bandits With no context, acting randomly is best; with context, associating action-value estimates to states plus exploring leads to higher average reward
2 2.11 Nonstationary bandit parameter study $\epsilon$-greedy is best here vs. worst on stationary testbed; optimistic greedy is unimpacted by $Q_0$
3 3.1 Finite MDP Examples MDPs are very general!
3 3.2 Limits of MDPs Ill-defined reward signal; very noisy observations of underlying state; very sparse rewards; highly unstable dynamics
3 3.3 Picking the agent-environment boundary There is a Pareto front of controllability + complexity
3 3.4 Deriving one-step dynamics from state transition probabilities The dynamics can be read off the state diagram
3 3.5 Episodic Modification for the Transition Distribution Dynamics function can't condition on terminal states else probabilities are undefined
3 3.6 Episodic vs. Continuing Returns of Pole-Balancing Returns lie on a wider interval for continuing formulations than episodic
3 3.7 The issue with sparse rewards Sparse rewards inhibit exploration; densifying the reward signal can help
3 3.8 Computing returns from rewards Recurrence lets us compute earlier returns from later ones
3 3.9 Early Returns with Infinite Horizons Infinite-horizon returns are easily computable when rewards are constant
3 3.10 Proof of the infinite geometric series identity Telescoping-sum method works here
3 3.11 Expected Reward in terms of Policy The product of policy and marginalized dynamics function yields the agent's expected rewards
3 3.12 State-value function in terms of action-value function State-value is the expectation of action-value under the policy's action probabilities in that state
3 3.13 Action-value function in terms of state-value function Action-value is the expected immediate reward plus the discounted expected next-state value
3 3.14 Center state-value in Gridworld When state-values of successor states are known it is easy to derive the state-value of a predecessor state from them
3 3.15 Adding a constant to all rewards in a continuing task This adds a constant to all state-values, leaving their relative values unchanged
3 3.16 Adding a constant to all rewards in an episodic task This shifts state-values randomly depending on episode length. Positive constants incentivize longer episodes. Negative constants incentivize shorter ones.
3 3.17 Bellman equation for action-values Action-values are the expected {immediate reward plus discounted action-value of successor state/actions}
3 3.18 State-value in terms of action-value State-value is the expected action-value
3 3.19 Action-value in terms of state-value Action-value is the expected immediate reward plus discounted next state value
3 3.20 Optimal state-value function for golf Optimal state-value function looks like optimal action-value function except where optimal actions are taken in each state
3 3.21 Optimal action-value function for putting The optimal action value is the immediate reward plus the optimal state-value of the resulting state after putting
3 3.22 Policy optimality under different discount factors Discount factor can change which policy is optimal
3 3.23 Optimal action-value equation for recycling robot We can write a system of equations for optimal action-values in terms of environment dynamics and other optimal action-values
3 3.24 Optimal state-value from the optimal policy Knowing the optimal policy can reveal the sequence of future rewards from a given state, which determines the optimal state-value for that state
3 3.25 Optimal state-value in terms of optimal action-value A state's optimal value equals the max of the optimal action-values for the actions the agent can take there
3 3.26 Optimal action-value in terms of optimal state-value and dynamics Optimal action-value is the average of the immediate reward and the discounted optimal state-value over successor states, which are weighted by the transition dynamics
3 3.27 Optimal policy in terms of optimal action-values Optimal policy is greedy w.r.t. optimal action-values; it picks actions that yield maximal optimal action-value
3 3.28 Optimal policy in terms of optimal state-values and dynamics Optimal policy picks actions that yield maximal next-reward plus discounted (optimal) state-value
3 3.29 Rewriting Bellman equations In MDPs, one-step look-ahead is enough for optimal decision-making!
4 4.1 Action-values of a Random Policy in Gridworld Action-values are computed easily when state-values are given
4 4.2 Changing one transition in the dynamics Usually altering any transition changes all state-values. But not if the alteration swaps state-values that are exactly equal!

About

Solutions to the exercises in Sutton & Barto's textbook Reinforcement Learning: An Introduction

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages