Markov Decision Process (MDP)

Markov decision processes formally describe an environment for reinforcement learning where the environment is fully observable.

i.e. The current state completely characterizes the process
Almost all problems can be casted as MDPs
Optimal control primarily deals with continuous MDPs
Bandits are MDPs with one state.

The future is independent of the past

Markov’s property

A state $S_{t}$ is Markov if and only if $P [S_{t + 1} ∣ S_{t}] = P [S_{t + 1} ∣ S_{1}, \dots, S_{t}]$ i.e. the current state captures all relevant information from the history.

$P$ is the conditional probability

i.e. The state is a sufficient statistic of the future

Definition

Markov Decision Process A Markov Decision Process is a tuple $(S, A, T, R, S_{0}, γ, H)$

$S$ is the set of all possible states

$A$ is the set of all possible actions

$T$ is the transition function $p (s^{'} ∣ s, a)$ which is the probability of landing at the next state $s^{'}$ given a previous state $s$ and a selected action $a$

$R$ is the reward function $r : S \times A \to R$ mapping the expected reward achieved in a transition starting at $(s, a)$ $r = E [R ∣ s, a]$

$S_{0}$ is the initial state distribution

$γ$ is the discount factor

$H$ is the planning horizon

A MDP is considered “solved” if we find a policy that maximizes the expected discounted return.

π max E [t = 0 \sum H γ^{t} R (S_{t}, A_{t}, S_{t + 1}) ∣ π]

i.e. the sum of discounted rewards from state $s$ and acting optimally

🚀 Costin Chitic

Recent Notes

Markov Decision Process (MDP)

Reinforcement Learning Basics

Flash Attention

Mixture of Experts (MOE)

Neural ODEs

Markov Decision Process (MDP)

Graph View

Backlinks