Setup

From the root of this repo, do:

python3.10 -m venv .venv
source .venv/bin/activate
pip install -e .
pip install -r requirements.txt

Exercise Discussions

For discussions of specific exercises, see their respective folders. Note that I will add solutions to exercises pre-2.5 soon!

Chapter	Exercise	Topic	Notes
2	2.5	Non-stationary bandit	Random walk on values -> noisier, slower convergence
2	2.6	Mysterious Spikes	↑ initial optimism $\implies$ ↑ initial determinism in greedy policy
2	2.7	Unbiased Constant-Step-Size Trick	Divide $\alpha$ by a function $f: \N \rarr [0, 1) $ which tends to 1 from below. Offsets bias of $Q_1$ by permitting larger-than-$\alpha$ updates at first, converging to $\alpha$-sized updates in the limit.
2	2.8	UCB Spikes	↑ exploration budget $\implies$ larger initial spike since bandit is 'forced off' the action with highest empirical mean
2	2.9	Gradient Bandits	2-variable softmax $\equiv$ sigmoid/logistic function
2	2.10	Contextual bandits	With no context, acting randomly is best; with context, associating action-value estimates to states plus exploring leads to higher average reward
2	2.11	Nonstationary bandit parameter study	$\epsilon$-greedy is best here vs. worst on stationary testbed; optimistic greedy is unimpacted by $Q_0$
3	3.1	Finite MDP Examples	MDPs are very general!
3	3.2	Limits of MDPs	Ill-defined reward signal; very noisy observations of underlying state; very sparse rewards; highly unstable dynamics
3	3.3	Picking the agent-environment boundary	There is a Pareto front of controllability + complexity
3	3.4	Deriving one-step dynamics from state transition probabilities	The dynamics can be read off the state diagram
3	3.5	Episodic Modification for the Transition Distribution	Dynamics function can't condition on terminal states else probabilities are undefined
3	3.6	Episodic vs. Continuing Returns of Pole-Balancing	Returns lie on a wider interval for continuing formulations than episodic
3	3.7	The issue with sparse rewards	Sparse rewards inhibit exploration; densifying the reward signal can help
3	3.8	Computing returns from rewards	Recurrence lets us compute earlier returns from later ones
3	3.9	Early Returns with Infinite Horizons	Infinite-horizon returns are easily computable when rewards are constant
3	3.10	Proof of the infinite geometric series identity	Telescoping-sum method works here
3	3.11	Expected Reward in terms of Policy	The product of policy and marginalized dynamics function yields the agent's expected rewards
3	3.12	State-value function in terms of action-value function	State-value is the expectation of action-value under the policy's action probabilities in that state
3	3.13	Action-value function in terms of state-value function	Action-value is the expected immediate reward plus the discounted expected next-state value
3	3.14	Center state-value in Gridworld	When state-values of successor states are known it is easy to derive the state-value of a predecessor state from them
3	3.15	Adding a constant to all rewards in a continuing task	This adds a constant to all state-values, leaving their relative values unchanged
3	3.16	Adding a constant to all rewards in an episodic task	This shifts state-values randomly depending on episode length. Positive constants incentivize longer episodes. Negative constants incentivize shorter ones.
3	3.17	Bellman equation for action-values	Action-values are the expected {immediate reward plus discounted action-value of successor state/actions}
3	3.18	State-value in terms of action-value	State-value is the expected action-value
3	3.19	Action-value in terms of state-value	Action-value is the expected immediate reward plus discounted next state value
3	3.20	Optimal state-value function for golf	Optimal state-value function looks like optimal action-value function except where optimal actions are taken in each state
3	3.21	Optimal action-value function for putting	The optimal action value is the immediate reward plus the optimal state-value of the resulting state after putting
3	3.22	Policy optimality under different discount factors	Discount factor can change which policy is optimal
3	3.23	Optimal action-value equation for recycling robot	We can write a system of equations for optimal action-values in terms of environment dynamics and other optimal action-values
3	3.24	Optimal state-value from the optimal policy	Knowing the optimal policy can reveal the sequence of future rewards from a given state, which determines the optimal state-value for that state
3	3.25	Optimal state-value in terms of optimal action-value	A state's optimal value equals the max of the optimal action-values for the actions the agent can take there
3	3.26	Optimal action-value in terms of optimal state-value and dynamics	Optimal action-value is the average of the immediate reward and the discounted optimal state-value over successor states, which are weighted by the transition dynamics
3	3.27	Optimal policy in terms of optimal action-values	Optimal policy is greedy w.r.t. optimal action-values; it picks actions that yield maximal optimal action-value
3	3.28	Optimal policy in terms of optimal state-values and dynamics	Optimal policy picks actions that yield maximal next-reward plus discounted (optimal) state-value
3	3.29	Rewriting Bellman equations	In MDPs, one-step look-ahead is enough for optimal decision-making!
4	4.1	Action-values of a Random Policy in Gridworld	Action-values are computed easily when state-values are given
4	4.2	Changing one transition in the dynamics	Usually altering any transition changes all state-values. But not if the alteration swaps state-values that are exactly equal!

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
docs		docs
suttonbarto		suttonbarto
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Setup

Exercise Discussions

About

Uh oh!

Releases

Packages

Languages

License

DavidBellamy/suttonbarto

Folders and files

Latest commit

History

Repository files navigation

Setup

Exercise Discussions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages