Markov decision processes formally describe an environment for reinforcement learning where the environment is fully observable.

  • i.e. The current state completely characterizes the process
  • Almost all problems can be casted as MDPs
  • Optimal control primarily deals with continuous MDPs
  • Bandits are MDPs with one state.

The future is independent of the past

Markov’s property

A state is Markov if and only if i.e. the current state captures all relevant information from the history.

  • is the conditional probability
  • i.e. The state is a sufficient statistic of the future

Definition

Markov Decision Process A Markov Decision Process is a tuple

  • is the set of all possible states
  • is the set of all possible actions
  • is the transition function which is the probability of landing at the next state given a previous state and a selected action
  • is the reward function mapping the expected reward achieved in a transition starting at
  • is the initial state distribution
  • is the discount factor
  • is the planning horizon

A MDP is considered β€œsolved” if we find a policy that maximizes the expected discounted return.

  • i.e. the sum of discounted rewards from state and acting optimally