Q-learning
|
|
This article may be too technical for most readers to understand. Please help improve this article to make it understandable to non-experts, without removing the technical details. The talk page may contain suggestions. (September 2010) |
Q-learning is a reinforcement learning technique that works by learning an action-value function that gives the expected utility of taking a given action in a given state and following a fixed policy thereafter. One of the strengths of Q-learning is that it is able to compare the expected utility of the available actions without requiring a model of the environment. A recent variation called delayed Q-learning has shown substantial improvements, bringing Probably approximately correct learning (PAC) bounds to Markov decision processes.[1]
Contents |
[edit] Algorithm
The problem model consists of an agent, states S and a set of actions per state A. By performing an action
, the agent can move from state to state. Each state provides the agent a reward (a real or natural number). The goal of the agent is to maximize its total reward. It does this by learning which action is optimal for each state.
The algorithm therefore has a function which calculates the Quality of a state-action combination:
Before learning has started, Q returns a fixed value, chosen by the designer. Then, each time the agent is given a reward (the state has changed) new values are calculated for each combination of a state s from S, and action a from A. The core of the algorithm is a simple value iteration update. It assumes the old value and makes a correction based on the new information.
where
is the reward observed after performing
in
,
(
) the learning rate (may be the same for all pairs). The discount factor
is such that 
The above formula is equivalent to:
![Q(s_t,a_t) \leftarrow Q(s_t,a_t)(1-\alpha_t(s_t,a_t)) + \alpha_t(s_t,a_t) [R_{t+1} + \gamma \max_{a_{t+1}}Q(s_{t+1}, a_{t+1})]](http://upload.wikimedia.org/wikipedia/en/math/3/5/4/35449975429aacd22cccd3be59ee4e1c.png)
An episode of the algorithm ends when state
is a final state (or, "absorbing state").
Note that for all final states
,
is never updated and thus retains its initial value.
[edit] Influence of variables on the algorithm
[edit] Learning rate
The learning rate determines to what extent the newly acquired information will override the old information. A factor of 0 will make the agent not learn anything, while a factor of 1 would make the agent consider only the most recent information.
[edit] Discount factor
The discount factor determines the importance of future rewards. A factor of 0 will make the agent "opportunistic" by only considering current rewards, while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the
values may diverge.
[edit] Implementation
Q-learning at its simplest uses tables to store data. This very quickly loses viability with increasing levels of complexity of the system it is monitoring/controlling. One answer to this problem is to use an (adapted) artificial neural network as a function approximator, as demonstrated by Tesauro in his Backgammon playing temporal difference learning research. An adaptation of the standard neural network is required because the required result (from which the error signal is generated) is itself generated at run-time.
[edit] Early study
Q-learning was first introduced by Watkins[2] in 1989.
The convergence proof was presented later by Watkins and Dayan[3] in 1992.
[edit] See also
- Reinforcement learning
- Temporal difference learning
- SARSA
- Iterated prisoner's dilemma
- Game theory
- Fitted Q iteration algorithm
[edit] External links
- Q-Learning topic on Knol
- Watkins, C.J.C.H. (1989). Learning from Delayed Rewards. PhD thesis, Cambridge University, Cambridge, England.
- Strehl, Li, Wiewiora, Langford, Littman (2006). PAC model-free reinforcement learning
- Q-Learning by Examples
- Reinforcement Learning: An Introduction by Richard Sutton and Andrew S. Barto, an online textbook. See "6.5 Q-Learning: Off-Policy TD Control".
- Connectionist Q-learning Java Framework
- Piqle: a Generic Java Platform for Reinforcement Learning
- Reinforcement Learning Maze, a demonstration of guiding an ant through a maze using Q-learning.
- Q-learning work by Gerald Tesauro
- Q-learning work by Tesauro Citeseer Link
- Q-learning algorithm implemented in processing.org language
[edit] References
- ^ Alexander L. Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L. Littman. Pac model-free reinforcement learning. In Proc. 23nd ICML 2006, pages 881–888, 2006.
- ^ Watkins, C.J.C.H., (1989), Learning from Delayed Rewards. Ph.D. thesis, Cambridge University.
- ^ Watkins and Dayan, C.J.C.H., (1992), 'Q-learning.Machine Learning', ISBN : 8:279-292
| This artificial intelligence-related article is a stub. You can help Wikipedia by expanding it. |

![Q(s_t,a_t) \leftarrow \underbrace{Q(s_t,a_t)}_{\rm old~value} + \underbrace{\alpha_t(s_t,a_t)}_{\rm learning~rate} \times \left[ \overbrace{\underbrace{R_{t+1}}_{\rm reward} + \underbrace{\gamma}_{\rm discount~factor} \underbrace{\max_{a_{t+1}}Q(s_{t+1}, a_{t+1})}_{\rm max~future~value}}^{\rm learned~value} - \underbrace{Q(s_t,a_t)}_{\rm old~value}\right]](http://upload.wikimedia.org/wikipedia/en/math/c/6/c/c6c7c54e9f20e3fced93d8d08584c289.png)