Talk:Bellman equation

From Wikipedia, the free encyclopedia
Jump to: navigation, search
          This article is of interest to the following WikiProjects:
WikiProject Mathematics (Rated Start-class, Mid-importance)
WikiProject Mathematics
This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of Mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Mathematics rating:
Start Class
Mid Importance
 Field: Applied mathematics
WikiProject Systems  
WikiProject icon This article is within the scope of WikiProject Systems, which collaborates on articles related to systems and systems science.
 ???  This article has not yet received a rating on the project's quality scale.
 ???  This article has not yet received a rating on the project's importance scale.
Taskforce icon
This article is within the field of Control theory.
WikiProject Economics (Rated Start-class, Mid-importance)
WikiProject icon This article is within the scope of WikiProject Economics, a collaborative effort to improve the coverage of Economics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Start-Class article Start  This article has been rated as Start-Class on the project's quality scale.
 Mid  This article has been rated as Mid-importance on the project's importance scale.

General explanation[edit]

Shuroo (talk) 09:19, 29 May 2009 (UTC) This explanation is too specific. We need a general explanation of the Bellman equation.


I can't find the "principle of optimality" anywhere in Bellman's 1952 paper. Perhaps this should be his 1957 book. —Preceding unsigned comment added by (talk) 17:01, 11 December 2007 (UTC)

Bellman's definition from Dynamic Programming, 1957 (Dover 2003 edition, p. 83):

Principle of Optimality. An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

--Rinconsoleao (talk) 19:57, 25 May 2009 (UTC)

Notation for dynamic problems[edit]

The equation given in the article is wrong, or at least not true in general. The right equation is

V(x) = \max_{y \in \Gamma (x) } [F(x,y) + \beta V(T(x,y))], \forall x \in X.

where T(x,y) is the transformation of the state given a state x and an action y.

Shuroo —Preceding unsigned comment added by Shuroo (talkcontribs) 20:27, 27 May 2009 (UTC)

That depends on how we are defining the notation. The way the problem was set up (I didn't write it myself) defines transitions by stating that the feasible future states are x_{t+1}\in\Gamma(x_t). So the 'transformation of the state' you mention is captured by that notation. Of course, I don't mind moving to clearer notation if others prefer. --Rinconsoleao (talk) 11:04, 28 May 2009 (UTC)

By the way, the notation appears to come from Stokey, Lucas, Prescott (1989), Chapter 4. --Rinconsoleao (talk) 11:07, 28 May 2009 (UTC)

The notation I have suggested is easier to understand and much more practical for dynamic programming problems with finite horizon. (It also better matches with the article in Vikipedia on dynamic programming.) —Preceding unsigned comment added by Shuroo (talkcontribs) 14:28, 28 May 2009 (UTC)

Notation for stochastic problems[edit]

Yet another argument for my notation is that it can be easily extended to stochastic problems (e.g., inventory management). To obtain the stochastic form of the equation we simply add E (expectation) before the second term since the new state is a random variable.

V(x) = \max_{y \in \Gamma (x) } [F(x,y) + \beta E(V(T(x,y)))], \forall x \in X.

where T(x,y) is the transformation of the state given a state x and an action y. Shuroo (talk) 09:21, 29 May 2009 (UTC)shuroo

I agree that the form you are proposing is probably easier to understand. Go ahead and change the notation if you want. (Notice that you will need to change the notation on the original sequence problem too, for consistency with the form of the Bellman equation you are proposing.) I'll keep an eye open for errors. --Rinconsoleao (talk) 10:21, 29 May 2009 (UTC)

Done Shuroo (talk) 08:51, 30 May 2009 (UTC)

I'm sorry, I thought your notation for the stochastic case would show more explicitly how the next state x_{t+1} is determined. The notation T(x,y) strikes me as misleading, because it looks like a function, which suggests that it should take a unique value. A rigorous way to define it would be to write a c.d.f. of the form G(x_{t+1}|x_t,a_t) representing the probability distribution of possible states x_{t+1} conditional on the current state x_{t} and action a_{t}. Alternatively, we could write something like x_{t+1}=T(x_t,a_t)+\epsilon_{t+1}, where \epsilon_{t+1} is a mean-zero shock, but that is more restrictive. --Rinconsoleao (talk) 13:47, 1 June 2009 (UTC)

What I wrote was a simple but valid formula that could explain the example next section. (I used T for the needed random variable but indicated it was a random variable , not a function.) It seems to me that adding too much details would make the article much less attractive because in order to introduce conditional distributions properly you need to add some heavy machinery.

Pity that you erased it.Shuroo (talk) 18:47, 1 June 2009 (UTC)

By the way, the current article uses the version that you erased, in the example, but in an implicit way. Surely it is better to write it explicitly. Shuroo (talk) 19:32, 1 June 2009 (UTC)

Now I've added a lot more discussion, and a stochastic example with (I think) rigorous notation. Let me know if you find errors.--Rinconsoleao (talk) 18:16, 4 June 2009 (UTC)

If you want to do the stochastic part proporly better write it in terms of a Markov decision process. Shuroo (talk) 10:13, 5 June 2009 (UTC)

I wrote it this way because the G(y|x,a) notation is actually very close to your y=T(x,a) notation, except that it makes explicit the fact that we are talking about a random variable. And it allowed me to write a very similar Bellman equation for the stochastic and deterministic cases (so the stochastic subsection ends up very short and simple without being incorrect). But we could certainly include a link to MDP too.--Rinconsoleao (talk) 11:03, 6 June 2009 (UTC)

In any case, if you want the article to be rigorous,at least to some extent, you certainly need to state the Markovian property of the system. Shuroo (talk) 18:57, 6 June 2009 (UTC)


"It breaks a dynamic optimization problem into simpler subproblems, writing the value of a decision problem at a certain point in time in terms of the payoff from some initial choices and the value of the remaining decision problem that results from those initial choices, as Bellman's Principle of Optimality prescribes." this sentance is VERY long and obfuscated ; it may deserve a reformulation dont you think —Preceding unsigned comment added by (talk) 09:35, 20 July 2010 (UTC)


As I understand it, the mathematical expression given under the Bellman's Principle of Optimality section is incorrect. The claim is made that it is the same as (the RHS of, presumably) the previous equation, but I don't think that's true -- as it stands, it is actually a description of a greedy solution, not a solution involving the principle of optimal substructure. The reason being that the first term greedily selects the action at time 0 giving maximum payoff, when in fact a suboptimal action at this time may maximise the value overall by bringing the system into a different state at time 1 whose value (maximum possible discounted sum of payoffs), when added to this (suboptimal) time-0 payoff, is greater. Not sure what the correct expression would be I'm afraid. (talk) 03:18, 15 February 2011 (UTC)

No, the expression does not refer to a 'greedy' algorithm-- in other words, it is not maximizing the first-period payoff in isolation. The control a0 is chosen so that it maximizes the first-period payoff PLUS the continuation value of the rest of the problem (which need not imply that it maximizes the first-period payoff by itself, as a greedy algorithm would). I will add another set of parentheses to make this clearer. Rinconsoleao (talk) 09:06, 15 February 2011 (UTC)

Value function[edit]

I think the term "value function" deserves an article alone!Luizabpr (talk) 19:07, 21 May 2011 (UTC)

Borken stochastic eqn[edit]

There is something wrong with the equation defining the conditional probability, but am not quite sure of the fix. Currently it says:

G(y|x,a) = {\rm prob}(x_{t+1}\leq y|x_t=x,a_t=a)

But this can't be right, since x_t is a state, and so in general, the idea of x_{t+1}\leq y is impossible to define. Perhaps the following was intended:

P_G(y|x,a) = {\rm prob}(G(x_{t+1})\leq y|x_t=x,a_t=a)

??? I dunno, please fix. linas (talk) 17:00, 28 June 2012 (UTC)

I don't understand what you mean when you say that x_{t+1}\leq y is impossible to define. Perhaps you are assuming that the state x_{t+1} is chosen at t, and is therefore deterministic at t. But the Bellman equation, as written, shows that the decision-maker chooses an action a at time t, which does not exactly determine x_{t+1}. Instead, it alters the probability distribution of x_{t+1}. That's what the conditional distribution G(y|x,a) = {\rm prob}(x_{t+1}\leq y|x_t=x,a_t=a) shows.
So I see no problem with the equation. Removing the discussion flag.Rinconsoleao (talk) 18:56, 19 September 2012 (UTC)

I agree with linas. The presentation of the article in general is a bit hand-wavy. One should at least state the principle of optimality precisely, IMHO. That particular section is confused, in terminology and conceptually. The stochastic shocks are usually assumed to follow a Markov process, which loosely speaking, maps a measurable space to probability measures on that space, zz. They are not "random variables". Neither is the state a "random variable, as the article right now says.
The Bellman equation is incorrect. The expectation should be taken with respect to the appropriate probability measure, the one that determines the next period shock. The Bellman equation depends on the timing of the model:
If control is being chosen after the current period shock is realized:
V(x, z) = \max_{a \in \Gamma(x,z)} F(x, a, z) + \beta \int V( T(x,a), z') d\mu_z(z').
If control is chosen ex ante:
V(x, z) = \max_{a \in \Gamma(x)} F(x, a) + \beta \int V(T(x, a, z), z') d_z(z') .
It is sloppy to say the state "next period" x_{t+1} is a "random variable". Once an optimal policy function is determined from the Bellman equation, then the state for a given period is a random variable, with the underlying probability space given by the Markov process. If one wants to condition on the current state x_{t}, then in the ex post case, the next period state x_{t+1} is not random at all. Mct mht (talk) 12:15, 16 March 2013 (UTC)