Jump to content

Talk:State–action–reward–state–action

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by RichardKatz (talk | contribs) at 00:30, 17 November 2011 (Correct Algorithm ?). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

WikiProject iconRobotics Stub‑class Low‑importance
WikiProject iconThis article is within the scope of WikiProject Robotics, a collaborative effort to improve the coverage of Robotics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
StubThis article has been rated as Stub-class on Wikipedia's content assessment scale.
LowThis article has been rated as Low-importance on the project's importance scale.

Date

When did this algorithm get invented ? XApple 19:46, 7 May 2007 (UTC)[reply]

First published 1994, added info. 220.253.135.178 16:50, 21 May 2007 (UTC)[reply]
Hey, thanks a lot for contributing to wikipedia ! XApple 23:05, 27 May 2007 (UTC)[reply]

Updates

For updates, SARSA uses the next action chosen, not the best next action, to reflect the value of the last state/action under the current policy. If using the best next action, you'll end up with Watkin's Q-Learning which SARSA was an attempt to provide an alternative to. By updating with the value of the best next action (Watkin's Q-Learning) the update can possibly over-estimate values, as the control method used will not pick this action all the time (due to the need to balance exploration and exploitation). A comparison between Q-Learning and SARSA, perhaps Cliff World from Rich Sutton's 'Reinforcement Learning An Introduction' (1998), may be useful to clarify the differences and the resulting behaviour --131.217.6.6 08:17, 29 May 2007 (UTC)[reply]


this is the algorithm presented in Q-Learning:

SARSA:

Uses "backpropagation"? updates previous Q entry with future reward? Dspattison (talk) 19:20, 19 March 2008 (UTC)[reply]

Correct Algorithm ?

Is the algorithm given correct? Should it not be R(t) not R(t+1) ? I've looked at [1] and that seems to support what Thrun & Norvig teach in their Stanford ai-class wheeliebin (talk) 04:58, 12 November 2011 (UTC)[reply]

Note also section 1 where the page states "Taking every letter in the quintuple" it lists "R(t+1)." Shouldn't this be "R(t)" as well? -- RDK