# Decentralized partially observable Markov decision process

The decentralized partially observable Markov decision process (Dec-POMDP) [1][2] is a model for coordination and decision-making among multiple agents. It is a probabilistic model that can consider uncertainty in outcomes, sensors and communication (i.e., costly, delayed, noisy or nonexistent communication). it is a generalization of a Markov decision process (MDP) and a partially observable Markov decision process (POMDP) to consider multiple decentralized agents.

## Definition

### Formal definition

A Dec-POMDP is a 7-tuple ${\displaystyle (S,\{A_{i}\},T,R,\{\Omega _{i}\},O,\gamma )}$, where

• ${\displaystyle S}$ is a set of states,
• ${\displaystyle A_{i}}$ is a set of actions for agent i, with ${\displaystyle A=\times _{i}A_{i}}$ is the set of joint actions,
• ${\displaystyle T}$ is a set of conditional transition probabilities between states, ${\displaystyle T(s,a,s')=P(s'\mid s,a)}$,
• ${\displaystyle R:S\times A\to \mathbb {R} }$ is the reward function.
• ${\displaystyle \Omega _{i}}$ is a set of observations for agent i, with ${\displaystyle \Omega =\times _{i}\Omega _{i}}$ is the set of joint observations,
• ${\displaystyle O}$ is a set of conditional observation probabilities ${\displaystyle O(s',a,o)=P(o\mid s',a)}$, and
• ${\displaystyle \gamma \in [0,1]}$ is the discount factor.

At each time step, each agent takes an action ${\displaystyle a_{i}\in A_{i}}$, the state updates based on the transition function ${\displaystyle T(s,a,s')}$ (using the current state and the joint action), each agent observes an observation based on the observation function ${\displaystyle O(s',a,o)}$ (using the next state and the joint action) and a reward is generated for the whole team based on the reward function ${\displaystyle R(s,a)}$. The goal is to maximize expected cumulative reward over a finite or infinite number of steps. These time steps repeat until some given horizon (called finite horizon) or forever (called infinite horizon). The discount factor ${\displaystyle \gamma }$ maintains a finite sum in the infinite-horizon case (${\displaystyle \gamma \in [0,1)}$).

## References

1. ^ Bernstein, Daniel S.; Givan, Robert; Immerman, Neil; Zilberstein, Shlomo (November 2002). "The Complexity of Decentralized Control of Markov Decision Processes". Math. Oper. Res. 27 (4): 819–840. arXiv:1301.3836. doi:10.1287/moor.27.4.819.297. ISSN 0364-765X.
2. ^ Oliehoek, Frans A.; Amato, Christopher (2016). A Concise Introduction to Decentralized POMDPs | SpringerLink (PDF). SpringerBriefs in Intelligent Systems. doi:10.1007/978-3-319-28929-8. ISBN 978-3-319-28927-4.