COMPLETED Tools

SuttonRL-Implementation

An interactive Rustlings-style platform for implementing every reinforcement learning algorithm from Sutton & Barto's textbook — 13 chapters, 18K+ lines of Python, with auto-grading and watch mode.

VIEW_SOURCE()

status COMPLETED

type Tools

stack ['Python' 'NumPy' 'matplotlib' 'Click' 'pytest' 'Reinforcement Learning']

// DESCRIPTION

Project Overview / 项目概述

SuttonRL-Implementation is a Rustlings-style interactive learning platform for implementing every major reinforcement learning algorithm presented in Reinforcement Learning: An Introduction (2nd Edition) by Richard S. Sutton and Andrew G. Barto. The project contains 18,000+ lines of Python across 13 chapters and provides a complete interactive CLI runner with auto-grading, watch mode, and progressive hints — making it the definitive hands-on companion to the canonical RL textbook.

SuttonRL-Implementation 是一个类 Rustlings 的交互式学习平台，旨在实现 Richard S. Sutton 和 Andrew G. Barto《强化学习：简介（第二版）》中介绍的所有主要强化学习算法。该项目包含跨越 13 个章节的 18,000+ 行 Python 代码，提供完整的交互式 CLI 运行器，支持自动批改、Watch 模式和渐进式提示，是配套经典 RL 教材的权威动手实践平台。

Situation & Motivation / 背景与动机

Reinforcement learning is notoriously difficult to learn from textbooks alone. The Sutton & Barto book is the field's gold standard, but its pseudocode descriptions leave a significant gap between understanding an algorithm conceptually and being able to implement it correctly. Existing reference implementations typically provide complete, runnable code — which is useful for study but eliminates the productive struggle that builds deep understanding. There was no project that provided structured, auto-graded, fill-in-the-blank exercises aligned chapter-by-chapter with the textbook.

强化学习从教材中学习是出了名的困难。Sutton & Barto 这本书是该领域的黄金标准，但其伪代码描述在概念理解和正确实现之间留下了巨大鸿沟。现有的参考实现通常提供完整可运行的代码，这对于研究有用，但消除了能够建立深度理解的“刻意练习”过程。目前还没有一个项目能够提供与教材逐章对应的结构化、自动批改的填空式练习。

Task & Design Goals / 目标与设计

Build a chapter-aligned exercise platform covering every core algorithm in the Sutton & Barto book, from multi-armed bandits (Chapter 2) through policy gradient methods (Chapter 13). Each exercise should expose only the skeleton of the algorithm, require the learner to implement the critical parts, and validate correctness both logically and numerically (performance metrics must fall within predefined acceptable ranges).

构建一个与章节对齐的练习平台，覆盖 Sutton & Barto 教材中的所有核心算法，从多臂老虎机（第 2 章）到策略梯度方法（第 13 章）。每道练习只暴露算法骨架，要求学习者实现关键部分，并在逻辑正确性和数值上进行验证（性能指标必须落在预定义的可接受范围内）。

Action & Implementation / 实施细节

The platform organises exercises into 8 chapters corresponding directly to textbook sections:

Ch02 — Multi-armed Bandits: ε-greedy, UCB, gradient bandit, optimistic initialisation
Ch04 — Dynamic Programming: policy evaluation, policy iteration, value iteration, gambler's problem
Ch05 — Monte Carlo Methods: first-visit MC, MC with ES, off-policy MC, importance sampling
Ch06 — Temporal-Difference Learning: TD(0), SARSA, Q-Learning, Expected SARSA
Ch07 — n-step Bootstrapping: n-step TD, n-step SARSA, tree-backup algorithm
Ch09–10 — Function Approximation: gradient MC, semi-gradient TD, episodic and differential SARSA
Ch12 — Eligibility Traces: TD(λ), SARSA(λ), True Online TD(λ)
Ch13 — Policy Gradient: REINFORCE, REINFORCE with baseline, Actor-Critic

Every exercise file contains a rich docstring with the algorithm description, mathematical requirements, performance targets (e.g. “average reward must reach 1.3 ± 0.2”), and a textbook reference. The learner fills in # TODO sections, then runs python -m sutton_rl run <exercise> to receive immediate pass/fail feedback alongside numerical performance metrics. A watch mode mirrors the Rustlings experience, re-running validation on every file save.

exercises/
├── ch02_bandits/
│   ├── ex01_epsilon_greedy.py
│   ├── ex02_ucb.py
│   ├── ex03_gradient_bandit.py
│   └── ex04_optimistic_initial.py
├── ch04_dp/ ...
├── ch06_td/ ...
├── ch12_eligibility/ ...
└── ch13_policy_gradient/ ...

Results & Impact / 成果与影响

The completed platform covers 13 chapters with 18,000+ lines of Python, providing one of the most comprehensive hands-on RL learning resources aligned with the Sutton & Barto textbook. The auto-grading system enforces not just code correctness but algorithmic performance, ensuring that learners genuinely master each concept before advancing. The structured beginner path (Ch02 → Ch04 → Ch06 → Ch05 → Ch07) and advanced path (Ch09–10 → Ch12 → Ch13) give clear learning progressions for different skill levels.

完成的平台覆盖 13 个章节，超过 18,000 行 Python 代码，提供了与 Sutton & Barto 教材对齐的最全面的 RL 动手学习资源之一。自动批改系统不仅检查代码正确性，还验证算法性能，确保学习者在进入下一阶段前真正掌握每个概念。结构化的初学者路径（Ch02 → Ch04 → Ch06 → Ch05 → Ch07）和进阶路径（Ch09–10 → Ch12 → Ch13）为不同水平的学习者提供了清晰的学习路线。

// HIGHLIGHTS

['Full chapter-by-chapter coverage of Sutton & Barto RL textbook (2nd Ed.) — Chapters 2 through 13', '18,000+ lines of Python across bandit algorithms, dynamic programming, Monte Carlo, TD-learning, and policy gradient methods', 'Fill-in-the-blank exercise format: algorithm skeletons with TODO markers and embedded performance targets', 'Auto-grading validates both logical correctness and numerical performance (e.g. average reward within ±0.2)', 'Rustlings-style watch mode: file-system monitoring with instant re-validation on save', 'Progressive hints system per exercise with configurable detail levels', 'Structured learning paths for beginners (bandits → DP → TD) and advanced learners (function approximation → eligibility traces → policy gradient)', 'Includes TD(λ), SARSA(λ), True Online TD(λ), Actor-Critic, and REINFORCE with baseline — covering the full modern RL toolkit']

< BACK_TO_PROJECTS()