Markov decision processes formally describe an environment for reinforcement learning where the environment is fully observable.
- i.e. The current state completely characterizes the process
- Almost all problems can be casted as MDPs
- Optimal control primarily deals with continuous MDPs
- Bandits are MDPs with one state.
The future is independent of the past
Markovβs property
A state is Markov if and only if i.e. the current state captures all relevant information from the history.
- is the conditional probability
- i.e. The state is a sufficient statistic of the future
Definition
Markov Decision Process A Markov Decision Process is a tuple
- is the set of all possible states
- is the set of all possible actions
- is the transition function which is the probability of landing at the next state given a previous state and a selected action
- is the reward function mapping the expected reward achieved in a transition starting at
- is the initial state distribution
- is the discount factor
- is the planning horizon
A MDP is considered βsolvedβ if we find a policy that maximizes the expected discounted return.
- i.e. the sum of discounted rewards from state and acting optimally