Talk:Reinforcement learning

Robotics C‑class Mid‑importance

	This article is within the scope of WikiProject Robotics, a collaborative effort to improve the coverage of Robotics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.RoboticsWikipedia:WikiProject RoboticsTemplate:WikiProject RoboticsRobotics articles
C	This article has been rated as C-class on Wikipedia's content assessment scale.
Mid	This article has been rated as Mid-importance on the project's importance scale.
	This article has been marked as needing immediate attention.

Question

Is R=Σ_tγ^tr_t, $R=\sum \limits _{t^{\gamma }}^{t}r_{t}$ or $R=\sum \limits _{t\gamma }^{t}r_{t}$ or $R=\sum \limits _{t}^{t}\gamma r_{t}$ ?

Answer: It is : $R=\sum \limits _{t=0}^{\infty }\gamma ^{t}r_{t}$

Policies

What exactly is a policy? The Sutton-Barto book is very vague on this point, and so is this article. In both cases the word is used without much explanation.

According to both the book and the article, a policy is a mapping from states to action probabilities. Fine. But this is not elaborated upon. What does a policy look like? I infer that it must be a table (2-D array), indexed by state and action, and containing probabilities, say p_ij for the i-th state and j-th action, each p_ij being a transition probability for the MDP. If so, what is its relation to the values derived from rewards? I.e. where exactly do the probabilities p_ij come from? How does one generate a policy table starting from values?

Sorry if I appear stupid, but I've been studying the book and I find it very difficult to comprehend, even though the maths is very simple (almost too simple). Or maybe it's in there somewhere but I've missed it?

--84.9.83.127 09:36, 18 November 2006 (UTC)[reply]

A policy is indeed a mapping from states to action probabilities, usually written π. So we could write π:S×A→[0,1], saying that π gives a probability of taking a given action a in state s. It doesn't have to be a table, it is just a function. If S and A are discrete then it can be easily written as a table, but if either is continuous then another form is needed. For instance, if S is the interval [0,10], we can set a number of radial basis functions over that interval (say, 11 of them, one at 0, one at 1, one at 2, etc.). Number them r₀, ... r₁₀. Now our policy is a function π:r₀×...×r₁₀×A→[0,1], which we can no longer write as a table.

The relation of the policy to values depends on the particular solution being used for the RL problem. In an actor-critic architecture, the policy is the set of state-action values along with a function for selecting an action (softmax, for instance, or just choosing the action with the highest value) and the state-action values are updated according to state values and the error signal. In a Q-learning agent, the policy and the values are essentially the same. Well, more correctly the policy is a function of the values given by the action selection mechanism.

For the most part, when you're just learning reinforcement learning theory, the use of policies may not be particularly clear. At least, in my own case, I didn't understand the focus on policies until I read Sutton, Precup, and Singh (1999) on options [1], at which point policies became crystal clear.

Hope that answers your question. digfarenough (talk) 19:25, 4 March 2007 (UTC)[reply]

Thanks. But your reply raises more questions for me, which I need to try and find answers to! --84.9.75.142 22:41, 16 March 2007 (UTC) (formerly 84.9.83.127)[reply]

Feel free to ask further questions on my talk page. I'm certainly no expert on reinforcement learning, but I've written one paper on it and have written a large number of simulations of RL-related things, so I at least know the basics. digfarenough (talk) 01:09, 17 March 2007 (UTC)[reply]

I hope the new version explains what a policy might mean. In fact, it has multiple meanings and is used somewhat inconsistently in the literature. Szepi (talk) 03:11, 7 September 2010 (UTC)[reply]

merge with Q learning

There is a short article on Q learning and could be merged with reinforcement learning Kpmiyapuram 14:23, 24 April 2007 (UTC)[reply]

I'd offer that Q Learning be expanded instead. In Q Learning's "See Also" there's Watkins' thesis, which I faintly remember is where Q Learning was introduced; but there's no mention of Watkins or any other researcher in the article. Additionally, Sutton's RL book is listed, which would be a great source to mine for further detail on history and application. --59.167.203.115 (talk) 01:17, 11 January 2008 (UTC)[reply]

I'd back Q-learning being expanded instead, with a summary in RL. As Q-learning is an active area of research it will grow over time, so it would be short-sighted to merge them - especially as they are already separate. At the start of my research it would have been SO helpful to know what was applicable to RL generally, and what was Q-Learning. --217.37.215.53 (talk) 10:05, 6 March 2008 (UTC)[reply]

algorithms/concepts not mentioned

active (policy improvement) vs passive (policy evaluation)
Adaptive Dynamic Programming (ADP) —Preceding unsigned comment added by 132.177.27.1 (talk) 17:23, 1 April 2008 (UTC)[reply]

Policy improvement and evaluation are included now. However, these methods are rarely if ever called active/passive. The problems addressed by these methods are control learning and prediction learning. These could be included..
ADP refers to approximate dynamic programming, as far as I know. I have added the term to the article. Thanks for the suggestions.

Szepi (talk) 03:20, 7 September 2010 (UTC)[reply]

Economics?

Where's all the stuff about learning in games? It would be great if someone could incorporate this. Jeremy Tobacman 23:40, 1 August 2007 (UTC)[reply]

It's certainly relevant, but you may have to add it yourself if you're familiar with the subject. I've come across that aspect a few times but never really looked into it, though I have seen quite a few papers on interacting multiagent systems from the game and economic perspectives (always, I think, the agents were working against each other to try to maximize profit or win the game, etc.). So add what you know, and others may be able to clean up any incorrect claims. digfarenough (talk) 16:31, 2 August 2007 (UTC)[reply]

Psychology

This article starts with a reference to 'Reinforcement learning' in psychology. Isn't there an article about that? --Rinconsoleao 13:43, 27 September 2007 (UTC)[reply]

Found it... --Rinconsoleao 13:45, 27 September 2007 (UTC)[reply]

Literature

I feel the literature referenced by Csaba Szepesvàri was a useful addition and perhaps should not have been removed. Even though he referenced a book written by himself, he is a well known and respected researcher in reinforcement learning and this book is a useful overview of the field. I do not know of many good recent alternatives, so I would favor reverting MrOllie's revision. However, rather than immediately doing so, I thought it might be better to start a discussion.

What literature would be indispenable? (In my opinion, in any case the books by Sutton & Barto and by Berstekas & Tsitsiklis, although most of the other referenced work at present also looks fine.)
What literature might be removed? (For instance, I haven't read the latest addition by Tokic, is this a relevant enough paper to include?)
Is there any important work missing? (As mentioned, I would favor the return of a reference to Csaba Szepesvàri's book.) —Preceding unsigned comment added by 192.16.201.233 (talk) 12:04, 20 September 2010 (UTC)[reply]

Attention needed

Check refs
Check content for missing staements
Assess on B scale

Chaosdruid (talk) 05:03, 6 March 2011 (UTC)[reply]

small and large mdps

'The theory of small mdps is [..] mature; [..] the theory of large mdps needs more work.'

What does that even mean ? Theory is theory; if you understand an mdp with 10 states, than you understand one with ten million states, although standard algorithms may run too slow, I can't see the conceptual difference between ten and ten million as far as theory is concerned.

Does the author mean either: a) small equals finite and large equals countably or uncountably infinite, or b) approximation methods (in itself only useful when direct methods fail) are not as well understood.

— Preceding unsigned comment added by 157.193.140.25 (talk) 09:21, 26 August 2011 (UTC)[reply]