End-to-end reinforcement learning

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

In end-to-end reinforcement learning, the end-to-end process, in other words, the entire process from sensors to motors in a robot or agent involves a single, layered or recurrent neural network without modularization, and is trained by reinforcement learning (RL).[1] The approach has been proposed for a long time,[2][3] but was reenergized by the successful results in learning to play Atari video games (2013–15)[4][5][6][7] and AlphaGo (2016)[8] by Google DeepMind.

RL traditionally required explicit design of state space and action space, while the mapping from state space to action space is learned.[9] Therefore, RL has been limited to learning only for action, and human designers have to design how to construct state space from sensor signals and to give how the motion commands are generated for each action before learning. Neural networks have been often used in RL, to provide non-linear function approximation to avoid the curse of dimensionality.[9] Recurrent neural networks have been also employed, mainly to avoid perceptual aliasing or partially observable Markov decision process (POMDP).[10][11][12][13][14]

End-to-end RL extends RL from learning only for actions to learning the entire process from sensors to motors including higher-level functions that are difficult to develop independently from other functions. Higher-level functions do not connect directly with either sensors or motors, and so even giving their inputs and outputs is difficult.


The approach originated in TD-Gammon (1992).[15] In backgammon, the evaluation of the game situation during self-play was learned through TD() using a layered neural network. Four inputs were used for the number of pieces of a given color at a given location on the board, totaling 198 input signals. With zero knowledge built in, the network learned to play the game at an intermediate level.

Shibata began working with this framework in 1997.[16][3] They employed Q-learning and actor-critic for continuous motion tasks,[17] and used a recurrent neural network for memory-required tasks.[18] They applied this framework to some real robot tasks.[17][19] They demonstrated learning of various functions.

Beginning around 2013, Google DeepMind showed impressive learning results in video games[4][5] and game of Go (AlphaGo).[8] They used a deep convolutional neural network that showed superior results in image recognition. They used 4 frames of almost raw RGB pixels (84x84) as inputs. The network was trained based on RL with the reward representing the sign of the change in the game score. All 49 games were learned using the same network architecture and Q-learning with minimal prior knowledge, and outperformed competing methods on almost all the games and performed at a level that is comparable or superior to a professional human game tester.[5] It is sometimes called Deep-Q network (DQN). In AlphaGo, deep neural networks are trained not only by reinforcement learning, but also by supervised learning and Monte Carlo tree search.[8]

Function emergence[edit]

Shibata's group showed that various functions emerge in this framework, including:[3]

  • Image recognition
  • Color constancy (optical illusion)
  • Sensor motion (active recognition)
  • Hand-eye coordination and hand reaching movement
  • Explanation of brain activities
  • Knowledge transfer
  • Memory
  • Selective attention
  • Prediction
  • Exploration

Communications were established in this framework. Modes include:[20]

  • Dynamic communication (negotiation)
  • Binalization of signals
  • Grounded communication using a real robot and camera


  1. ^ Demis, Hassabis (March 11, 2016). Artificial Intelligence and the Future (Speech).
  2. ^ Shibata, Katsunari (January 14, 2011). "Chapter 6: Emergence of Intelligence through Reinforcement Learning with a Neural Network". In Mellouk, Abdelhamid (ed.). Advances in Reinforcement Learning. Intech. pp. 99–120. ISBN 978-953-307-369-9.
  3. ^ a b c Shibata, Katsunari (March 7, 2017). "Functions that Emerge through End-to-End Reinforcement Learning". arXiv:1703.02239.
  4. ^ a b Mnih, Volodymyr; et al. (December 2013). Playing Atari with Deep Reinforcement Learning (PDF). NIPS Deep Learning Workshop 2013.
  5. ^ a b c Mnih, Volodymyr; et al. (2015). "Human-level control through deep reinforcement learning". Nature. 518 (7540): 529–533. Bibcode:2015Natur.518..529M. doi:10.1038/nature14236.
  6. ^ V. Mnih et al. (26 February 2015). Performance of DQN in the Game Space Invaders.
  7. ^ V. Mnih et al. (26 February 2015). Demonstration of Learning Progress in the Game Breakout.
  8. ^ a b c Silver, David; Huang, Aja; Maddison, Chris J.; Guez, Arthur; Sifre, Laurent; Driessche, George van den; Schrittwieser, Julian; Antonoglou, Ioannis; Panneershelvam, Veda; Lanctot, Marc; Dieleman, Sander; Grewe, Dominik; Nham, John; Kalchbrenner, Nal; Sutskever, Ilya; Lillicrap, Timothy; Leach, Madeleine; Kavukcuoglu, Koray; Graepel, Thore; Hassabis, Demis (28 January 2016). "Mastering the game of Go with deep neural networks and tree search". Nature. 529 (7587): 484–489. Bibcode:2016Natur.529..484S. doi:10.1038/nature16961. ISSN 0028-0836. PMID 26819042. Retrieved 10 December 2017.closed access
  9. ^ a b Sutton, Richard S.; Barto, Andrew G. (1998). Reinforcement Learning: An Introduction. MIT Press. ISBN 978-0262193986.
  10. ^ Lin, Long-Ji; Mitchell, Tom M. (1993). Reinforcement Learning with Hidden States. From Animals to Animats. 2. pp. 271–280.
  11. ^ Onat, Ahmet; Kita, Hajime; et al. (1998). Q-learning with Recurrent Neural Networks as a Controller for the Inverted Pendulum Problem. The 5th International Conference on Neural Information Processing (ICONIP). pp. 837–840.
  12. ^ Onat, Ahmet; Kita, Hajime; et al. (1998). Recurrent Neural Networks for Reinforcement Learning: Architecture, Learning Algorithms and Internal Representation. International Joint Conference on Neural Networks (IJCNN). pp. 2010–2015.
  13. ^ Bakker, Bram; Linaker, Fredrik; et al. (2002). Reinforcement Learning in Partially Observable Mobile Robot Domains Using Unsupervised Event Extraction (PDF). 2002 IEEE/RSJ International Conference on. Intelligent Robots and Systems (IROS). pp. 938–943.
  14. ^ Bakker, Bram; Zhumatiy, Viktor; et al. (2003). A Robot that Reinforcement-Learns to Identify and Memorize Important Previous Observation (PDF). 2003 IEEE/RSJ International Conference on. Intelligent Robots and Systems (IROS). pp. 430–435.
  15. ^ Tesauro, Gerald (March 1995). "Temporal Difference Learning and TD-Gammon". Communications of the ACM. 38 (3): 58–68. doi:10.1145/203330.203343.
  16. ^ Shibata, Katsunari; Okabe, Yoichi (1997). Reinforcement Learning When Visual Sensory Signals are Directly Given as Inputs (PDF). International Conference on Neural Networks (ICNN) 1997.
  17. ^ a b Shibata, Katsunari; Iida, Masaru (2003). Acquisition of Box Pushing by Direct-Vision-Based Reinforcement Learning (PDF). SICE Annual Conference 2003.
  18. ^ Utsunomiya, Hiroki; Shibata, Katsunari (2008). Contextual Behavior and Internal Representations Acquired by Reinforcement Learning with a Recurrent Neural Network in a Continuous State and Action Space Task (PDF). International Conference on Neural Information Processing (ICONIP) '08.
  19. ^ Shibata, Katsunari; Kawano, Tomohiko (2008). Learning of Action Generation from Raw Camera Images in a Real-World-like Environment by Simple Coupling of Reinforcement Learning and a Neural Network (PDF). International Conference on Neural Information Processing (ICONIP) '08.
  20. ^ Shibata, Katsunari (March 9, 2017). "Communications that Emerge through Reinforcement Learning Using a (Recurrent) Neural Network". arXiv:1703.03543.