Jump to content

End-to-end reinforcement learning: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
AnomieBOT (talk | contribs)
m Dating maintenance tags: {{Merge to}}
Tag: Reverted
Anair13 (talk | contribs)
Merge; redirect to deep reinforcement learning
Tags: New redirect Manual revert
 
Line 1: Line 1:
{{Merge to|deep reinforcement learning|discuss=Talk:End-to-end_reinforcement_learning|date=October 2021}}
#REDIRECT [[Deep reinforcement learning]] {{R from merge}}

In '''end-to-end reinforcement learning''', the entire process from sensors to motors in a robot or agent (called the end-to-end process) involves a single, layered or [[recurrent neural network]] without modularization, and is trained by [[reinforcement learning]] (RL).<ref name="Hassabis">{{cite speech |last1=Demis |first1=Hassabis | date=March 11, 2016 |title= Artificial Intelligence and the Future. |url= https://www.youtube.com/watch?v=8Z2eLTSCuBk}}</ref> The approach has been proposed for a long time,<ref name="Shibata2">{{cite arXiv |last=Shibata |first=Katsunari |title=Functions that Emerge through End-to-End Reinforcement Learning | date=March 7, 2017 |eprint=1703.02239 |class=cs.AI }}</ref> but was reenergized by the successful results in learning to play [[Atari]] video games (2013–15)<ref name="DQN1">{{cite conference |first= Volodymyr|display-authors=etal|last= Mnih |date=December 2013 |title= Playing Atari with Deep Reinforcement Learning |url= https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf |conference= NIPS Deep Learning Workshop 2013}}</ref><ref name="DQN2">{{cite journal |first= Volodymyr|display-authors=etal|last= Mnih |year=2015 |title= Human-level control through deep reinforcement learning |journal=Nature|volume=518 |issue=7540 |pages=529–533 |doi=10.1038/nature14236|pmid=25719670|bibcode=2015Natur.518..529M }}</ref><ref name="Invaders">{{cite video |people= V. Mnih|display-authors=etal| date=26 February 2015 |title= Performance of DQN in the Game Space Invaders |url= http://www.nature.com/nature/journal/v518/n7540/extref/nature14236-sv1.mov}}</ref><ref name="Breakout">{{cite video |people= V. Mnih|display-authors=etal| date=26 February 2015 |title= Demonstration of Learning Progress in the Game Breakout |url= http://www.nature.com/nature/journal/v518/n7540/extref/nature14236-sv2.mov}}</ref> and [[AlphaGo]] (2016)<ref name="AlphaGo">{{Cite journal|title = Mastering the game of Go with deep neural networks and tree search|journal = [[Nature (journal)|Nature]]| issn= 0028-0836|pages = 484–489|volume = 529|issue = 7587|doi = 10.1038/nature16961|pmid = 26819042|first1 = David|last1 = Silver|author-link1=David Silver (programmer)|first2 = Aja|last2 = Huang|author-link2=Aja Huang|first3 = Chris J.|last3 = Maddison|first4 = Arthur|last4 = Guez|first5 = Laurent|last5 = Sifre|first6 = George van den|last6 = Driessche|first7 = Julian|last7 = Schrittwieser|first8 = Ioannis|last8 = Antonoglou|first9 = Veda|last9 = Panneershelvam|first10= Marc|last10= Lanctot|first11= Sander|last11= Dieleman|first12=Dominik|last12= Grewe|first13= John|last13= Nham|first14= Nal|last14= Kalchbrenner|first15= Ilya|last15= Sutskever|author-link15=Ilya Sutskever|first16= Timothy|last16= Lillicrap|first17= Madeleine|last17= Leach|first18= Koray|last18= Kavukcuoglu|first19= Thore|last19= Graepel|first20= Demis |last20=Hassabis|author-link20=Demis Hassabis|date= 28 January 2016|bibcode = 2016Natur.529..484S}}{{closed access}}</ref> by [[Google DeepMind]].

RL traditionally required explicit design of state space and action space, while the mapping from state space to action space is learned.<ref name="RL">{{cite book | last1 = Sutton | first1 = Richard S. |last2=Barto |first2=Andrew G. | title = Reinforcement Learning: An Introduction | publisher = MIT Press | year = 1998 | isbn = 978-0262193986}}</ref> Therefore, RL has been limited to learning only for action, and human designers have to design how to construct state space from sensor signals and to give how the motion commands are generated for each action before learning. Neural networks have been often used in RL, to provide non-linear function approximation to avoid the [[curse of dimensionality]].<ref name="RL" /> [[Recurrent neural networks]] have been also employed, mainly to avoid perceptual aliasing or [[partially observable Markov decision process]] (POMDP).<ref name="Lin">{{cite conference |first1= Long-Ji |last1= Lin |first2= Tom M. |last2= Mitchell |year=1993 |title= Reinforcement Learning with Hidden States | journal=From Animals to Animats |volume=2 | pages=271–280 }}</ref><ref name="Onat1">{{cite conference |first1= Ahmet |last1= Onat |first2= Hajime|display-authors=etal|last2= Kita |year=1998 |title= Q-learning with Recurrent Neural Networks as a Controller for the Inverted Pendulum Problem | conference=The 5th International Conference on Neural Information Processing (ICONIP) |pages=837–840}}</ref><ref name="Onat2">{{cite conference |first1= Ahmet |last1= Onat |first2= Hajime|display-authors=etal|last2= Kita |year=1998 |title= Recurrent Neural Networks for Reinforcement Learning: Architecture, Learning Algorithms and Internal Representation |conference=International Joint Conference on Neural Networks (IJCNN) |pages=2010–2015|doi= 10.1109/IJCNN.1998.687168 }}</ref><ref name="Bakker1">{{cite conference |first1= Bram |last1= Bakker |first2= Fredrik|display-authors=etal|last2= Linaker |year=2002 |title= Reinforcement Learning in Partially Observable Mobile Robot Domains Using Unsupervised Event Extraction |url=ftp://ftp.idsia.ch/pub/juergen/bakkeriros2002.pdf| conference= 2002 IEEE/RSJ International Conference on. Intelligent Robots and Systems (IROS) |pages=938–943}}</ref><ref name="Bakker2">{{cite conference |first1= Bram |last1= Bakker |first2= Viktor|display-authors=etal|last2= Zhumatiy |year=2003 |title= A Robot that Reinforcement-Learns to Identify and Memorize Important Previous Observation |url=ftp://ftp.idsia.ch/pub/juergen/bakkeriros2003.pdf| conference= 2003 IEEE/RSJ International Conference on. Intelligent Robots and Systems (IROS) |pages=430–435}}</ref>

End-to-end RL extends RL from learning only for actions to learning the entire process from sensors to motors including higher-level functions that are difficult to develop independently from other functions. Higher-level functions do not connect directly with either sensors or motors, and so even giving their inputs and outputs is difficult.

== History ==
The approach originated in [[TD-Gammon]] (1992).<ref name="TD-Gammon">{{cite journal | url=http://www.bkgm.com/articles/tesauro/tdl.html | title=Temporal Difference Learning and TD-Gammon | date=March 1995 | last=Tesauro | first=Gerald | journal=Communications of the ACM | volume=38 | issue=3 | doi=10.1145/203330.203343 | pages=58–68 | access-date=2017-03-10 | archive-url=https://web.archive.org/web/20100209103427/http://www.bkgm.com/articles/tesauro/tdl.html | archive-date=2010-02-09 | url-status=dead }}</ref> In [[backgammon]], the evaluation of the game situation during self-play was learned through TD(<math>\lambda</math>) using a layered neural network. Four inputs were used for the number of pieces of a given color at a given location on the board, totaling 198 input signals. With zero knowledge built in, the network learned to play the game at an intermediate level.

Shibata began working with this framework in 1997.<ref name="Shibata3">{{cite conference |first1= Katsunari |last1= Shibata |first2= Yoichi |last2= Okabe |year=1997 |title= Reinforcement Learning When Visual Sensory Signals are Directly Given as Inputs |url= http://shws.cc.oita-u.ac.jp/~shibata/pub/ICNN97.pdf |conference= International Conference on Neural Networks (ICNN) 1997}}</ref><ref name="Shibata2" /> They employed [[Q-learning]] and actor-critic for continuous motion tasks,<ref name="Shibata4">{{cite conference |first1= Katsunari |last1= Shibata |first2= Masaru |last2= Iida |year=2003 |title= Acquisition of Box Pushing by Direct-Vision-Based Reinforcement Learning |url= http://shws.cc.oita-u.ac.jp/~shibata/pub/SICE03.pdf |conference= SICE Annual Conference 2003}}</ref> and used a [[recurrent neural network]] for memory-required tasks.<ref name="Shibata5">{{cite conference |first1= Hiroki |last1= Utsunomiya |first2= Katsunari |last2= Shibata |year= 2008 |title= Contextual Behavior and Internal Representations Acquired by Reinforcement Learning with a Recurrent Neural Network in a Continuous State and Action Space Task |url= http://shws.cc.oita-u.ac.jp/~shibata/pub/ICONIP98Utsunomiya.pdf |conference= International Conference on Neural Information Processing (ICONIP) '08 }}{{Dead link|date=December 2019 |bot=InternetArchiveBot |fix-attempted=yes }}</ref> They applied this framework to some real robot tasks.<ref name="Shibata4" /><ref name="Shibata6">{{cite conference |first1= Katsunari |last1= Shibata |first2= Tomohiko |last2= Kawano |year=2008 |title= Learning of Action Generation from Raw Camera Images in a Real-World-like Environment by Simple Coupling of Reinforcement Learning and a Neural Network |url= http://shws.cc.oita-u.ac.jp/~shibata/pub/ICONIP98.pdf |conference= International Conference on Neural Information Processing (ICONIP) '08}}</ref> They demonstrated learning of various functions.

Beginning around 2013, Google DeepMind showed impressive learning results in video games<ref name="DQN1" /><ref name="DQN2" /> and game of Go ([[AlphaGo]]).<ref name="AlphaGo" /> They used a deep [[convolutional neural network]] that showed superior results in image recognition. They used 4 frames of almost raw RGB pixels (84x84) as inputs. The network was trained based on RL with the reward representing the sign of the change in the game score. All 49 games were learned using the same network architecture and [[Q-learning]] with minimal prior knowledge, and outperformed competing methods on almost all the games and performed at a level that is comparable or superior to a professional human game tester.<ref name="DQN2" /> It is sometimes called Deep-Q network (DQN). In [[AlphaGo]], deep neural networks are trained not only by [[reinforcement learning]], but also by [[supervised learning]] and [[Monte Carlo tree search]].<ref name="AlphaGo" />

== Function emergence ==
Shibata's group showed that various functions emerge in this framework, including:<ref name="Shibata2" />

* Image recognition
* Color constancy (optical illusion)
* Sensor motion (active recognition)
* Hand-eye coordination and hand reaching movement
* Explanation of brain activities
* Knowledge transfer
* Memory
* Selective attention
* Prediction
* Exploration

Communications were established in this framework. Modes include:<ref name="Shibata7">{{cite arXiv|eprint=1703.03543|first=Katsunari|last=Shibata|title=Communications that Emerge through Reinforcement Learning Using a (Recurrent) Neural Network|date=March 9, 2017|class=cs.AI}}</ref>

* Dynamic communication (negotiation)
* Binalization of signals
* Grounded communication using a real robot and camera

== References ==
<!-- Inline citations added to your article will automatically display here. See https://en.wikipedia.org/wiki/WP:REFB for instructions on how to add citations. -->
{{reflist}}

[[Category:Reinforcement learning]]
[[Category:Machine_learning]]

Latest revision as of 22:44, 31 October 2021

  • From a merge: This is a redirect from a page that was merged into another page. This redirect was kept in order to preserve the edit history of this page after its content was merged into the content of the target page. Please do not remove the tag that generates this text (unless the need to recreate content on this page has been demonstrated) or delete this page.