SuttonRL-Implementation
An interactive Rustlings-style platform for implementing every reinforcement learning algorithm from Sutton & Barto's textbook — 13 chapters, 18K+ lines of Python, with auto-grading and watch mode.
// DESCRIPTION
Project Overview / 项目概述
SuttonRL-Implementation is a Rustlings-style interactive learning platform for implementing every major reinforcement learning algorithm presented in Reinforcement Learning: An Introduction (2nd Edition) by Richard S. Sutton and Andrew G. Barto. The project contains 18,000+ lines of Python across 13 chapters and provides a complete interactive CLI runner with auto-grading, watch mode, and progressive hints — making it the definitive hands-on companion to the canonical RL textbook.
SuttonRL-Implementation 是一个类 Rustlings 的交互式学习平台,旨在实现 Richard S. Sutton 和 Andrew G. Barto《强化学习:简介(第二版)》中介绍的所有主要强化学习算法。 该项目包含跨越 13 个章节的 18,000+ 行 Python 代码,提供完整的交互式 CLI 运行器, 支持自动批改、Watch 模式和渐进式提示,是配套经典 RL 教材的权威动手实践平台。
Situation & Motivation / 背景与动机
Reinforcement learning is notoriously difficult to learn from textbooks alone. The Sutton & Barto book is the field's gold standard, but its pseudocode descriptions leave a significant gap between understanding an algorithm conceptually and being able to implement it correctly. Existing reference implementations typically provide complete, runnable code — which is useful for study but eliminates the productive struggle that builds deep understanding. There was no project that provided structured, auto-graded, fill-in-the-blank exercises aligned chapter-by-chapter with the textbook.
强化学习从教材中学习是出了名的困难。Sutton & Barto 这本书是该领域的黄金标准, 但其伪代码描述在概念理解和正确实现之间留下了巨大鸿沟。现有的参考实现通常提供完整可运行的代码, 这对于研究有用,但消除了能够建立深度理解的“刻意练习”过程。 目前还没有一个项目能够提供与教材逐章对应的结构化、自动批改的填空式练习。
Task & Design Goals / 目标与设计
Build a chapter-aligned exercise platform covering every core algorithm in the Sutton & Barto book, from multi-armed bandits (Chapter 2) through policy gradient methods (Chapter 13). Each exercise should expose only the skeleton of the algorithm, require the learner to implement the critical parts, and validate correctness both logically and numerically (performance metrics must fall within predefined acceptable ranges).
构建一个与章节对齐的练习平台,覆盖 Sutton & Barto 教材中的所有核心算法, 从多臂老虎机(第 2 章)到策略梯度方法(第 13 章)。每道练习只暴露算法骨架, 要求学习者实现关键部分,并在逻辑正确性和数值上进行验证(性能指标必须落在预定义的可接受范围内)。
Action & Implementation / 实施细节
The platform organises exercises into 8 chapters corresponding directly to textbook sections:
- Ch02 — Multi-armed Bandits: ε-greedy, UCB, gradient bandit, optimistic initialisation
- Ch04 — Dynamic Programming: policy evaluation, policy iteration, value iteration, gambler's problem
- Ch05 — Monte Carlo Methods: first-visit MC, MC with ES, off-policy MC, importance sampling
- Ch06 — Temporal-Difference Learning: TD(0), SARSA, Q-Learning, Expected SARSA
- Ch07 — n-step Bootstrapping: n-step TD, n-step SARSA, tree-backup algorithm
- Ch09–10 — Function Approximation: gradient MC, semi-gradient TD, episodic and differential SARSA
- Ch12 — Eligibility Traces: TD(λ), SARSA(λ), True Online TD(λ)
- Ch13 — Policy Gradient: REINFORCE, REINFORCE with baseline, Actor-Critic
Every exercise file contains a rich docstring with the algorithm description, mathematical
requirements, performance targets (e.g. “average reward must reach 1.3 ± 0.2”),
and a textbook reference. The learner fills in # TODO sections, then runs
python -m sutton_rl run <exercise> to receive immediate pass/fail feedback
alongside numerical performance metrics. A watch mode mirrors the Rustlings experience,
re-running validation on every file save.
exercises/ ├── ch02_bandits/ │ ├── ex01_epsilon_greedy.py │ ├── ex02_ucb.py │ ├── ex03_gradient_bandit.py │ └── ex04_optimistic_initial.py ├── ch04_dp/ ... ├── ch06_td/ ... ├── ch12_eligibility/ ... └── ch13_policy_gradient/ ...
Results & Impact / 成果与影响
The completed platform covers 13 chapters with 18,000+ lines of Python, providing one of the most comprehensive hands-on RL learning resources aligned with the Sutton & Barto textbook. The auto-grading system enforces not just code correctness but algorithmic performance, ensuring that learners genuinely master each concept before advancing. The structured beginner path (Ch02 → Ch04 → Ch06 → Ch05 → Ch07) and advanced path (Ch09–10 → Ch12 → Ch13) give clear learning progressions for different skill levels.
完成的平台覆盖 13 个章节,超过 18,000 行 Python 代码,提供了与 Sutton & Barto 教材 对齐的最全面的 RL 动手学习资源之一。自动批改系统不仅检查代码正确性,还验证算法性能, 确保学习者在进入下一阶段前真正掌握每个概念。结构化的初学者路径(Ch02 → Ch04 → Ch06 → Ch05 → Ch07) 和进阶路径(Ch09–10 → Ch12 → Ch13)为不同水平的学习者提供了清晰的学习路线。