From the root of this repo, do:
python3.10 -m venv .venv
source .venv/bin/activate
pip install -e .
pip install -r requirements.txt
For discussions of specific exercises, see their respective folders. Note that I will add solutions to exercises pre-2.5 soon!
| Chapter | Exercise | Topic | Notes |
|---|---|---|---|
| 2 | 2.5 | Non-stationary bandit | Random walk on values -> noisier, slower convergence |
| 2 | 2.6 | Mysterious Spikes | ↑ initial optimism |
| 2 | 2.7 | Unbiased Constant-Step-Size Trick | Divide |
| 2 | 2.8 | UCB Spikes | ↑ exploration budget |
| 2 | 2.9 | Gradient Bandits | 2-variable softmax |
| 2 | 2.10 | Contextual bandits | With no context, acting randomly is best; with context, associating action-value estimates to states plus exploring leads to higher average reward |
| 2 | 2.11 | Nonstationary bandit parameter study |
|
| 3 | 3.1 | Finite MDP Examples | MDPs are very general! |
| 3 | 3.2 | Limits of MDPs | Ill-defined reward signal; very noisy observations of underlying state; very sparse rewards; highly unstable dynamics |
| 3 | 3.3 | Picking the agent-environment boundary | There is a Pareto front of controllability + complexity |
| 3 | 3.4 | Deriving one-step dynamics from state transition probabilities | The dynamics can be read off the state diagram |
| 3 | 3.5 | Episodic Modification for the Transition Distribution | Dynamics function can't condition on terminal states else probabilities are undefined |
| 3 | 3.6 | Episodic vs. Continuing Returns of Pole-Balancing | Returns lie on a wider interval for continuing formulations than episodic |
| 3 | 3.7 | The issue with sparse rewards | Sparse rewards inhibit exploration; densifying the reward signal can help |
| 3 | 3.8 | Computing returns from rewards | Recurrence lets us compute earlier returns from later ones |
| 3 | 3.9 | Early Returns with Infinite Horizons | Infinite-horizon returns are easily computable when rewards are constant |
| 3 | 3.10 | Proof of the infinite geometric series identity | Telescoping-sum method works here |
| 3 | 3.11 | Expected Reward in terms of Policy | The product of policy and marginalized dynamics function yields the agent's expected rewards |
| 3 | 3.12 | State-value function in terms of action-value function | State-value is the expectation of action-value under the policy's action probabilities in that state |
| 3 | 3.13 | Action-value function in terms of state-value function | Action-value is the expected immediate reward plus the discounted expected next-state value |
| 3 | 3.14 | Center state-value in Gridworld | When state-values of successor states are known it is easy to derive the state-value of a predecessor state from them |
| 3 | 3.15 | Adding a constant to all rewards in a continuing task | This adds a constant to all state-values, leaving their relative values unchanged |
| 3 | 3.16 | Adding a constant to all rewards in an episodic task | This shifts state-values randomly depending on episode length. Positive constants incentivize longer episodes. Negative constants incentivize shorter ones. |
| 3 | 3.17 | Bellman equation for action-values | Action-values are the expected {immediate reward plus discounted action-value of successor state/actions} |
| 3 | 3.18 | State-value in terms of action-value | State-value is the expected action-value |
| 3 | 3.19 | Action-value in terms of state-value | Action-value is the expected immediate reward plus discounted next state value |
| 3 | 3.20 | Optimal state-value function for golf | Optimal state-value function looks like optimal action-value function except where optimal actions are taken in each state |
| 3 | 3.21 | Optimal action-value function for putting | The optimal action value is the immediate reward plus the optimal state-value of the resulting state after putting |
| 3 | 3.22 | Policy optimality under different discount factors | Discount factor can change which policy is optimal |
| 3 | 3.23 | Optimal action-value equation for recycling robot | We can write a system of equations for optimal action-values in terms of environment dynamics and other optimal action-values |
| 3 | 3.24 | Optimal state-value from the optimal policy | Knowing the optimal policy can reveal the sequence of future rewards from a given state, which determines the optimal state-value for that state |
| 3 | 3.25 | Optimal state-value in terms of optimal action-value | A state's optimal value equals the max of the optimal action-values for the actions the agent can take there |
| 3 | 3.26 | Optimal action-value in terms of optimal state-value and dynamics | Optimal action-value is the average of the immediate reward and the discounted optimal state-value over successor states, which are weighted by the transition dynamics |
| 3 | 3.27 | Optimal policy in terms of optimal action-values | Optimal policy is greedy w.r.t. optimal action-values; it picks actions that yield maximal optimal action-value |
| 3 | 3.28 | Optimal policy in terms of optimal state-values and dynamics | Optimal policy picks actions that yield maximal next-reward plus discounted (optimal) state-value |
| 3 | 3.29 | Rewriting Bellman equations | In MDPs, one-step look-ahead is enough for optimal decision-making! |
| 4 | 4.1 | Action-values of a Random Policy in Gridworld | Action-values are computed easily when state-values are given |
| 4 | 4.2 | Changing one transition in the dynamics | Usually altering any transition changes all state-values. But not if the alteration swaps state-values that are exactly equal! |