Reinforcement Learning

Reinforcement learning (RL) is a branch of machine learning concerned with how intelligent agents ought to take actions in an environment to maximize cumulative reward. Unlike supervised learning, which relies on labeled datasets, and unsupervised learning, which focuses on finding hidden patterns, RL learns through trial-and-error interaction.

The paradigm draws inspiration from behavioral psychology and optimal control theory. An agent observes the state of an environment, selects an action, receives a scalar reward signal, and transitions to a new state. Over time, the agent learns a policy—a mapping from states to actions—that maximizes the expected sum of future rewards.

Key Distinction

Unlike supervised learning where feedback is immediate and correct, RL feedback is delayed, sparse, and often stochastic. The agent must balance exploration (trying new actions) with exploitation (using known good actions).

Mathematical Framework

The standard formalism for reinforcement learning is the Markov Decision Process (MDP), defined by a tuple $(S, A, P, R, \gamma)$:

$S$: A finite set of states
$A$: A finite set of actions
$P(s'|s, a)$: Transition probability of reaching state $s'$ given state $s$ and action $a$
$R(s, a, s')$: Reward function
$\gamma \in [0, 1]$: Discount factor determining the present value of future rewards

The objective is to learn a policy $\pi(a|s)$ that maximizes the expected discounted return:

G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}

The Bellman equations provide recursive relationships for value functions, forming the theoretical foundation for most RL algorithms.

Core Concepts

1. Policy ($\pi$)

A policy defines the agent's behavior. It can be deterministic ($\pi: S \rightarrow A$) or stochastic ($\pi: S \rightarrow \Delta(A)$). The optimal policy $\pi^*$ maximizes the expected return from every state.

2. Value Functions

Value functions evaluate how good it is to be in a state or to take an action from a state:

State-value function: $V^\pi(s) = \mathbb{E}_\pi[G_t | S_t = s]$
Action-value function: $Q^\pi(s, a) = \mathbb{E}_\pi[G_t | S_t = s, A_t = a]$

3. Environment Model

Some algorithms learn or use an explicit model of the environment's dynamics $P$ and $R$. Others are model-free, learning directly from experience without representing the environment explicitly.

Major Algorithms

Value-Based Methods

These algorithms focus on estimating the action-value function $Q(s,a)$ and deriving the policy implicitly by selecting actions with the highest Q-value.

Q-Learning: Off-policy algorithm that learns the optimal Q-function regardless of the agent's behavior.
SARSA: On-policy variant that updates Q-values based on the actual action taken by the current policy.
Deep Q-Networks (DQN): Combines Q-learning with deep neural networks and experience replay to handle high-dimensional state spaces (e.g., raw pixels).

Policy-Based Methods

Directly optimize the policy parameters using gradient ascent on expected return.

REINFORCE: Monte Carlo policy gradient method.
Proximal Policy Optimization (PPO): Modern, stable on-policy algorithm that uses clipped surrogate objectives to prevent destructive policy updates.

Actor-Critic Methods

Combine value-based and policy-based approaches. The actor updates the policy, while the critic evaluates the policy by estimating the value function. This reduces variance compared to pure policy gradients and speeds up convergence compared to pure value methods.

# Simplified PPO Update Step (Pseudocode)
for epoch in range(update_epochs):
    advantage = compute_advantage(states, rewards, dones)
    ratios = torch.exp(torch.log(new_probs) - old_probs)
    surr1 = ratios * advantage
    surr2 = torch.clamp(ratios, 1-clip_epsilon, 1+clip_epsilon) * advantage
    policy_loss = -torch.min(surr1, surr2).mean()

Real-World Applications

Game AI: AlphaGo, AlphaZero, and Libratus achieved superhuman performance in Go, chess, shogi, and poker.
Robotics: Sim-to-real transfer enables robots to learn locomotion, manipulation, and navigation through reinforcement learning in simulation.
Recommendation Systems: YouTube and Netflix use RL to optimize long-term user engagement rather than immediate clicks.
Autonomous Vehicles: Decision-making modules use RL for lane changes, intersection navigation, and merging.
Healthcare: Dynamic treatment regimens and dosage optimization for chronic diseases.
Finance: Algorithmic trading, portfolio management, and risk assessment.

Challenges & Open Problems

Sample Inefficiency: Many RL algorithms require millions of interactions, making them impractical for real-world systems where data collection is expensive or dangerous.
Sparse & Delayed Rewards: Environments where feedback is rare (e.g., winning a game only at the end) severely hinder learning progress.
Exploration vs. Exploitation: Finding optimal strategies to explore unknown states without wasting excessive time on suboptimal actions remains mathematically challenging.
Safety & Alignment: Agents may learn reward-hacking behaviors that maximize the reward signal but violate intended constraints or ethical guidelines.
Transfer & Generalization: Policies trained in one environment often fail to generalize to slightly different settings (distribution shift).

Current research focuses on offline RL (learning from static datasets), model-based RL (improving sample efficiency), multi-agent RL, and safe RL to address these limitations.

References & Further Reading

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533.
Schulman, J., et al. (2017). Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347.
Silver, D., et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419), 1140-1144.
Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4, 237-285.