HapticV1, Pmc, Curation, bibRecord, 001E32

Learning tactile skills through curious exploration

Identifieur interne : 001E32 ( Pmc/Curation ); précédent : 001E31; suivant : 001E33

Learning tactile skills through curious exploration

Auteurs : Leo Pape [Suisse] ; Calogero M. Oddo [Italie] ; Marco Controzzi [Italie] ; Christian Cipriani [Italie] ; Alexander Förster [Suisse] ; Maria C. Carrozza [Italie] ; Jürgen Schmidhuber [Suisse]

Source :

Frontiers in Neurorobotics [ 1662-5218 ] ; 2012.

RBID : PMC:3401897

Abstract

We present curiosity-driven, autonomous acquisition of tactile exploratory skills on a biomimetic robot finger equipped with an array of microelectromechanical touch sensors. Instead of building tailored algorithms for solving a specific tactile task, we employ a more general curiosity-driven reinforcement learning approach that autonomously learns a set of motor skills in absence of an explicit teacher signal. In this approach, the acquisition of skills is driven by the information content of the sensory input signals relative to a learner that aims at representing sensory inputs using fewer and fewer computational resources. We show that, from initially random exploration of its environment, the robotic system autonomously develops a small set of basic motor skills that lead to different kinds of tactile input. Next, the system learns how to exploit the learned motor skills to solve supervised texture classification tasks. Our approach demonstrates the feasibility of autonomous acquisition of tactile skills on physical robotic platforms through curiosity-driven reinforcement learning, overcomes typical difficulties of engineered solutions for active tactile exploration and underactuated control, and provides a basis for studying developmental learning through intrinsic motivation in robots.

Url:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3401897

DOI: 10.3389/fnbot.2012.00006
PubMed: 22837748
PubMed Central: 3401897

Links toward previous steps (curation, corpus...)

to stream Pmc, to step Corpus: Pour aller vers cette notice dans l'étape Curation :001E32

Links to Exploration step

PMC:3401897

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Learning tactile skills through curious exploration</title>
<author><name sortKey="Pape, Leo" sort="Pape, Leo" uniqKey="Pape L" first="Leo" last="Pape">Leo Pape</name>
<affiliation wicri:level="1"><nlm:aff id="aff1"><institution>Istituto Dalle Molle di Studi sull'Intelligenza Artificiale, Università della Svizzera Italiana</institution>
<country>Lugano, Switzerland</country>
</nlm:aff>
<country xml:lang="fr">Suisse</country>
<wicri:regionArea></wicri:regionArea>
<wicri:regionArea># see nlm:aff region in country</wicri:regionArea>
</affiliation>
</author>
<author><name sortKey="Oddo, Calogero M" sort="Oddo, Calogero M" uniqKey="Oddo C" first="Calogero M." last="Oddo">Calogero M. Oddo</name>
<affiliation wicri:level="1"><nlm:aff id="aff2"><institution>The BioRobotics Institute, Scuola Superiore Sant'Anna</institution>
<country>Pisa, Italy</country>
</nlm:aff>
<country xml:lang="fr">Italie</country>
<wicri:regionArea></wicri:regionArea>
<wicri:regionArea># see nlm:aff region in country</wicri:regionArea>
</affiliation>
</author>
<author><name sortKey="Controzzi, Marco" sort="Controzzi, Marco" uniqKey="Controzzi M" first="Marco" last="Controzzi">Marco Controzzi</name>
<affiliation wicri:level="1"><nlm:aff id="aff2"><institution>The BioRobotics Institute, Scuola Superiore Sant'Anna</institution>
<country>Pisa, Italy</country>
</nlm:aff>
<country xml:lang="fr">Italie</country>
<wicri:regionArea></wicri:regionArea>
<wicri:regionArea># see nlm:aff region in country</wicri:regionArea>
</affiliation>
</author>
<author><name sortKey="Cipriani, Christian" sort="Cipriani, Christian" uniqKey="Cipriani C" first="Christian" last="Cipriani">Christian Cipriani</name>
<affiliation wicri:level="1"><nlm:aff id="aff2"><institution>The BioRobotics Institute, Scuola Superiore Sant'Anna</institution>
<country>Pisa, Italy</country>
</nlm:aff>
<country xml:lang="fr">Italie</country>
<wicri:regionArea></wicri:regionArea>
<wicri:regionArea># see nlm:aff region in country</wicri:regionArea>
</affiliation>
</author>
<author><name sortKey="Forster, Alexander" sort="Forster, Alexander" uniqKey="Forster A" first="Alexander" last="Förster">Alexander Förster</name>
<affiliation wicri:level="1"><nlm:aff id="aff1"><institution>Istituto Dalle Molle di Studi sull'Intelligenza Artificiale, Università della Svizzera Italiana</institution>
<country>Lugano, Switzerland</country>
</nlm:aff>
<country xml:lang="fr">Suisse</country>
<wicri:regionArea></wicri:regionArea>
<wicri:regionArea># see nlm:aff region in country</wicri:regionArea>
</affiliation>
</author>
<author><name sortKey="Carrozza, Maria C" sort="Carrozza, Maria C" uniqKey="Carrozza M" first="Maria C." last="Carrozza">Maria C. Carrozza</name>
<affiliation wicri:level="1"><nlm:aff id="aff2"><institution>The BioRobotics Institute, Scuola Superiore Sant'Anna</institution>
<country>Pisa, Italy</country>
</nlm:aff>
<country xml:lang="fr">Italie</country>
<wicri:regionArea></wicri:regionArea>
<wicri:regionArea># see nlm:aff region in country</wicri:regionArea>
</affiliation>
</author>
<author><name sortKey="Schmidhuber, Jurgen" sort="Schmidhuber, Jurgen" uniqKey="Schmidhuber J" first="Jürgen" last="Schmidhuber">Jürgen Schmidhuber</name>
<affiliation wicri:level="1"><nlm:aff id="aff1"><institution>Istituto Dalle Molle di Studi sull'Intelligenza Artificiale, Università della Svizzera Italiana</institution>
<country>Lugano, Switzerland</country>
</nlm:aff>
<country xml:lang="fr">Suisse</country>
<wicri:regionArea></wicri:regionArea>
<wicri:regionArea># see nlm:aff region in country</wicri:regionArea>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">22837748</idno>
<idno type="pmc">3401897</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3401897</idno>
<idno type="RBID">PMC:3401897</idno>
<idno type="doi">10.3389/fnbot.2012.00006</idno>
<date when="2012">2012</date>
<idno type="wicri:Area/Pmc/Corpus">001E32</idno>
<idno type="wicri:Area/Pmc/Curation">001E32</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">Learning tactile skills through curious exploration</title>
<author><name sortKey="Pape, Leo" sort="Pape, Leo" uniqKey="Pape L" first="Leo" last="Pape">Leo Pape</name>
<affiliation wicri:level="1"><nlm:aff id="aff1"><institution>Istituto Dalle Molle di Studi sull'Intelligenza Artificiale, Università della Svizzera Italiana</institution>
<country>Lugano, Switzerland</country>
</nlm:aff>
<country xml:lang="fr">Suisse</country>
<wicri:regionArea></wicri:regionArea>
<wicri:regionArea># see nlm:aff region in country</wicri:regionArea>
</affiliation>
</author>
<author><name sortKey="Oddo, Calogero M" sort="Oddo, Calogero M" uniqKey="Oddo C" first="Calogero M." last="Oddo">Calogero M. Oddo</name>
<affiliation wicri:level="1"><nlm:aff id="aff2"><institution>The BioRobotics Institute, Scuola Superiore Sant'Anna</institution>
<country>Pisa, Italy</country>
</nlm:aff>
<country xml:lang="fr">Italie</country>
<wicri:regionArea></wicri:regionArea>
<wicri:regionArea># see nlm:aff region in country</wicri:regionArea>
</affiliation>
</author>
<author><name sortKey="Controzzi, Marco" sort="Controzzi, Marco" uniqKey="Controzzi M" first="Marco" last="Controzzi">Marco Controzzi</name>
<affiliation wicri:level="1"><nlm:aff id="aff2"><institution>The BioRobotics Institute, Scuola Superiore Sant'Anna</institution>
<country>Pisa, Italy</country>
</nlm:aff>
<country xml:lang="fr">Italie</country>
<wicri:regionArea></wicri:regionArea>
<wicri:regionArea># see nlm:aff region in country</wicri:regionArea>
</affiliation>
</author>
<author><name sortKey="Cipriani, Christian" sort="Cipriani, Christian" uniqKey="Cipriani C" first="Christian" last="Cipriani">Christian Cipriani</name>
<affiliation wicri:level="1"><nlm:aff id="aff2"><institution>The BioRobotics Institute, Scuola Superiore Sant'Anna</institution>
<country>Pisa, Italy</country>
</nlm:aff>
<country xml:lang="fr">Italie</country>
<wicri:regionArea></wicri:regionArea>
<wicri:regionArea># see nlm:aff region in country</wicri:regionArea>
</affiliation>
</author>
<author><name sortKey="Forster, Alexander" sort="Forster, Alexander" uniqKey="Forster A" first="Alexander" last="Förster">Alexander Förster</name>
<affiliation wicri:level="1"><nlm:aff id="aff1"><institution>Istituto Dalle Molle di Studi sull'Intelligenza Artificiale, Università della Svizzera Italiana</institution>
<country>Lugano, Switzerland</country>
</nlm:aff>
<country xml:lang="fr">Suisse</country>
<wicri:regionArea></wicri:regionArea>
<wicri:regionArea># see nlm:aff region in country</wicri:regionArea>
</affiliation>
</author>
<author><name sortKey="Carrozza, Maria C" sort="Carrozza, Maria C" uniqKey="Carrozza M" first="Maria C." last="Carrozza">Maria C. Carrozza</name>
<affiliation wicri:level="1"><nlm:aff id="aff2"><institution>The BioRobotics Institute, Scuola Superiore Sant'Anna</institution>
<country>Pisa, Italy</country>
</nlm:aff>
<country xml:lang="fr">Italie</country>
<wicri:regionArea></wicri:regionArea>
<wicri:regionArea># see nlm:aff region in country</wicri:regionArea>
</affiliation>
</author>
<author><name sortKey="Schmidhuber, Jurgen" sort="Schmidhuber, Jurgen" uniqKey="Schmidhuber J" first="Jürgen" last="Schmidhuber">Jürgen Schmidhuber</name>
<affiliation wicri:level="1"><nlm:aff id="aff1"><institution>Istituto Dalle Molle di Studi sull'Intelligenza Artificiale, Università della Svizzera Italiana</institution>
<country>Lugano, Switzerland</country>
</nlm:aff>
<country xml:lang="fr">Suisse</country>
<wicri:regionArea></wicri:regionArea>
<wicri:regionArea># see nlm:aff region in country</wicri:regionArea>
</affiliation>
</author>
</analytic>
<series><title level="j">Frontiers in Neurorobotics</title>
<idno type="eISSN">1662-5218</idno>
<imprint><date when="2012">2012</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><p>We present curiosity-driven, autonomous acquisition of tactile exploratory skills on a biomimetic robot finger equipped with an array of microelectromechanical touch sensors. Instead of building tailored algorithms for solving a specific tactile task, we employ a more general curiosity-driven reinforcement learning approach that autonomously learns a set of motor skills in absence of an explicit teacher signal. In this approach, the acquisition of skills is driven by the information content of the sensory input signals relative to a learner that aims at representing sensory inputs using fewer and fewer computational resources. We show that, from initially random exploration of its environment, the robotic system autonomously develops a small set of basic motor skills that lead to different kinds of tactile input. Next, the system learns how to exploit the learned motor skills to solve supervised texture classification tasks. Our approach demonstrates the feasibility of autonomous acquisition of tactile skills on physical robotic platforms through curiosity-driven reinforcement learning, overcomes typical difficulties of engineered solutions for active tactile exploration and underactuated control, and provides a basis for studying developmental learning through intrinsic motivation in robots.</p>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct><analytic><author><name sortKey="Arai, H" uniqKey="Arai H">H. Arai</name>
</author>
<author><name sortKey="Tachi, S" uniqKey="Tachi S">S. Tachi</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bensmaia, S J" uniqKey="Bensmaia S">S. J. Bensmaïa</name>
</author>
<author><name sortKey="Hollins, M" uniqKey="Hollins M">M. Hollins</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bensmaia, S J" uniqKey="Bensmaia S">S. J. Bensmaia</name>
</author>
<author><name sortKey="Hollins, M" uniqKey="Hollins M">M. Hollins</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Blake, D T" uniqKey="Blake D">D. T. Blake</name>
</author>
<author><name sortKey="Hsiao, S S" uniqKey="Hsiao S">S. S. Hsiao</name>
</author>
<author><name sortKey="Johnson, K O" uniqKey="Johnson K">K. O. Johnson</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bourlard, H" uniqKey="Bourlard H">H. Bourlard</name>
</author>
<author><name sortKey="Kamp, Y" uniqKey="Kamp Y">Y. Kamp</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Buchholz, B" uniqKey="Buchholz B">B. Buchholz</name>
</author>
<author><name sortKey="Armstrong, T J" uniqKey="Armstrong T">T. J. Armstrong</name>
</author>
<author><name sortKey="Goldstein, S A" uniqKey="Goldstein S">S. A. Goldstein</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Connor, C E" uniqKey="Connor C">C. E. Connor</name>
</author>
<author><name sortKey="Hsiao, S S" uniqKey="Hsiao S">S. S. Hsiao</name>
</author>
<author><name sortKey="Phillips, J R" uniqKey="Phillips J">J. R. Phillips</name>
</author>
<author><name sortKey="Johnson, K O" uniqKey="Johnson K">K. O. Johnson</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Connor, E" uniqKey="Connor E">E. Connor</name>
</author>
<author><name sortKey="Johnson, K O" uniqKey="Johnson K">K. O. Johnson</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Fishel, J A" uniqKey="Fishel J">J. A. Fishel</name>
</author>
<author><name sortKey="Loeb, G E" uniqKey="Loeb G">G. E. Loeb</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Gardner, E P" uniqKey="Gardner E">E. P. Gardner</name>
</author>
<author><name sortKey="Palmer, C I" uniqKey="Palmer C">C. I. Palmer</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Gordon, G" uniqKey="Gordon G">G. Gordon</name>
</author>
<author><name sortKey="Ahissar, E" uniqKey="Ahissar E">E. Ahissar</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Hollins, M" uniqKey="Hollins M">M. Hollins</name>
</author>
<author><name sortKey="Risner, S R" uniqKey="Risner S">S. R. Risner</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Hollins, M" uniqKey="Hollins M">M. Hollins</name>
</author>
<author><name sortKey="Bensmaia, S J" uniqKey="Bensmaia S">S. J. Bensmaia</name>
</author>
<author><name sortKey="Washburn, S" uniqKey="Washburn S">S. Washburn</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Johansson, R S" uniqKey="Johansson R">R. S. Johansson</name>
</author>
<author><name sortKey="Flanagan, J R" uniqKey="Flanagan J">J. R. Flanagan</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Johansson, R S" uniqKey="Johansson R">R. S. Johansson</name>
</author>
<author><name sortKey="Vallbo, A B" uniqKey="Vallbo A">A. B. Vallbo</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Jones, L A" uniqKey="Jones L">L. A. Jones</name>
</author>
<author><name sortKey="Lederman, S J" uniqKey="Lederman S">S. J. Lederman</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kaelbling, L P" uniqKey="Kaelbling L">L. P. Kaelbling</name>
</author>
<author><name sortKey="Littman, M L" uniqKey="Littman M">M. L. Littman</name>
</author>
<author><name sortKey="Moore, A W" uniqKey="Moore A">A. W. Moore</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kompella, V R" uniqKey="Kompella V">V. R. Kompella</name>
</author>
<author><name sortKey="Pape, L" uniqKey="Pape L">L. Pape</name>
</author>
<author><name sortKey="Masci, J" uniqKey="Masci J">J. Masci</name>
</author>
<author><name sortKey="Frank, M" uniqKey="Frank M">M. Frank</name>
</author>
<author><name sortKey="Schmidhuber, J" uniqKey="Schmidhuber J">J. Schmidhuber</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Konidaris, G D" uniqKey="Konidaris G">G. D. Konidaris</name>
</author>
<author><name sortKey="Kuindersma, S R" uniqKey="Kuindersma S">S. R. Kuindersma</name>
</author>
<author><name sortKey="Grupen, R A" uniqKey="Grupen R">R. A. Grupen</name>
</author>
<author><name sortKey="Barto, A G" uniqKey="Barto A">A. G. Barto</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lagoudakis, M G" uniqKey="Lagoudakis M">M. G. Lagoudakis</name>
</author>
<author><name sortKey="Parr, R" uniqKey="Parr R">R. Parr</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lamotte, R H" uniqKey="Lamotte R">R. H. LaMotte</name>
</author>
<author><name sortKey="Srinivasan, M A" uniqKey="Srinivasan M">M. A. Srinivasan</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lederman, S J" uniqKey="Lederman S">S. J. Lederman</name>
</author>
<author><name sortKey="Klatzky, R L" uniqKey="Klatzky R">R. L. Klatzky</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lloyd, S P" uniqKey="Lloyd S">S. P. Lloyd</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Morley, J W" uniqKey="Morley J">J. W. Morley</name>
</author>
<author><name sortKey="Goodwin, A W" uniqKey="Goodwin A">A. W. Goodwin</name>
</author>
<author><name sortKey="Darian Smith, I" uniqKey="Darian Smith I">I. Darian-Smith</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Mugan, J" uniqKey="Mugan J">J. Mugan</name>
</author>
<author><name sortKey="Kuipers, B" uniqKey="Kuipers B">B. Kuipers</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Oddo, C M" uniqKey="Oddo C">C. M. Oddo</name>
</author>
<author><name sortKey="Beccai, L" uniqKey="Beccai L">L. Beccai</name>
</author>
<author><name sortKey="Wessberg, J" uniqKey="Wessberg J">J. Wessberg</name>
</author>
<author><name sortKey="Backlund Wasling, H" uniqKey="Backlund Wasling H">H. Backlund Wasling</name>
</author>
<author><name sortKey="Mattioli, F" uniqKey="Mattioli F">F. Mattioli</name>
</author>
<author><name sortKey="Carrozza, M C" uniqKey="Carrozza M">M. C. Carrozza</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Oddo, C M" uniqKey="Oddo C">C. M. Oddo</name>
</author>
<author><name sortKey="Controzzi, M" uniqKey="Controzzi M">M. Controzzi</name>
</author>
<author><name sortKey="Beccai, L" uniqKey="Beccai L">L. Beccai</name>
</author>
<author><name sortKey="Cipriani, C" uniqKey="Cipriani C">C. Cipriani</name>
</author>
<author><name sortKey="Carrozza, M C" uniqKey="Carrozza M">M. C. Carrozza</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Oudeyer, P Y" uniqKey="Oudeyer P">P. Y. Oudeyer</name>
</author>
<author><name sortKey="Kaplan, F" uniqKey="Kaplan F">F. Kaplan</name>
</author>
<author><name sortKey="Hafner, V" uniqKey="Hafner V">V. Hafner</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Prescott, T J" uniqKey="Prescott T">T. J. Prescott</name>
</author>
<author><name sortKey="Diamond, M E" uniqKey="Diamond M">M. E. Diamond</name>
</author>
<author><name sortKey="Wing, A M" uniqKey="Wing A">A. M. Wing</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Radwin, R" uniqKey="Radwin R">R. Radwin</name>
</author>
<author><name sortKey="Jeng, O" uniqKey="Jeng O">O. Jeng</name>
</author>
<author><name sortKey="Gisske, E" uniqKey="Gisske E">E. Gisske</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Schmidhuber, J" uniqKey="Schmidhuber J">J. Schmidhuber</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Schmidhuber, J" uniqKey="Schmidhuber J">J. Schmidhuber</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Sutton, R" uniqKey="Sutton R">R. Sutton</name>
</author>
<author><name sortKey="Barto, A" uniqKey="Barto A">A. Barto</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Vigorito, C M" uniqKey="Vigorito C">C. M. Vigorito</name>
</author>
<author><name sortKey="Barto, A G" uniqKey="Barto A">A. G. Barto</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Yoshioka, T" uniqKey="Yoshioka T">T. Yoshioka</name>
</author>
<author><name sortKey="Gibb, B" uniqKey="Gibb B">B. Gibb</name>
</author>
<author><name sortKey="Dorsch, A K" uniqKey="Dorsch A">A. K. Dorsch</name>
</author>
<author><name sortKey="Hsiao, S S" uniqKey="Hsiao S">S. S. Hsiao</name>
</author>
<author><name sortKey="Johnson, K O" uniqKey="Johnson K">K. O. Johnson</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Yoshioka, T" uniqKey="Yoshioka T">T. Yoshioka</name>
</author>
<author><name sortKey="Bensmaia, S J" uniqKey="Bensmaia S">S. J. Bensmaia</name>
</author>
<author><name sortKey="Craig, J C" uniqKey="Craig J">J. C. Craig</name>
</author>
<author><name sortKey="Hsiao, S S" uniqKey="Hsiao S">S. S. Hsiao</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article"><pmc-dir>properties open_access</pmc-dir>
  <front><journal-meta><journal-id journal-id-type="nlm-ta">Front Neurorobot</journal-id>
<journal-id journal-id-type="iso-abbrev">Front Neurorobot</journal-id>
<journal-id journal-id-type="publisher-id">Front. Neurorobot.</journal-id>
<journal-title-group><journal-title>Frontiers in Neurorobotics</journal-title>
</journal-title-group>
<issn pub-type="epub">1662-5218</issn>
<publisher><publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="pmid">22837748</article-id>
<article-id pub-id-type="pmc">3401897</article-id>
<article-id pub-id-type="doi">10.3389/fnbot.2012.00006</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Neuroscience</subject>
<subj-group><subject>Original Research Article</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group><article-title>Learning tactile skills through curious exploration</article-title>
</title-group>
<contrib-group><contrib contrib-type="author"><name><surname>Pape</surname>
<given-names>Leo</given-names>
</name>
<xref ref-type="aff" rid="aff1"><sup>1</sup>
</xref>
<xref ref-type="author-notes" rid="fn001"><sup>*</sup>
</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Oddo</surname>
<given-names>Calogero M.</given-names>
</name>
<xref ref-type="aff" rid="aff2"><sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Controzzi</surname>
<given-names>Marco</given-names>
</name>
<xref ref-type="aff" rid="aff2"><sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Cipriani</surname>
<given-names>Christian</given-names>
</name>
<xref ref-type="aff" rid="aff2"><sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Förster</surname>
<given-names>Alexander</given-names>
</name>
<xref ref-type="aff" rid="aff1"><sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Carrozza</surname>
<given-names>Maria C.</given-names>
</name>
<xref ref-type="aff" rid="aff2"><sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Schmidhuber</surname>
<given-names>Jürgen</given-names>
</name>
<xref ref-type="aff" rid="aff1"><sup>1</sup>
</xref>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup>
<institution>Istituto Dalle Molle di Studi sull'Intelligenza Artificiale, Università della Svizzera Italiana</institution>
<country>Lugano, Switzerland</country>
</aff>
<aff id="aff2"><sup>2</sup>
<institution>The BioRobotics Institute, Scuola Superiore Sant'Anna</institution>
<country>Pisa, Italy</country>
</aff>
<author-notes><fn fn-type="edited-by"><p>Edited by: Robyn Grant, University of Sheffield, UK</p>
</fn>
<fn fn-type="edited-by"><p>Reviewed by: Nathan F. Lepora, University of Sheffield, UK; Benjamin Kuipers, University of Michigan, USA</p>
</fn>
<corresp id="fn001">*Correspondence: Leo Pape, Istituto Dalle Molle di Studi sull'Intelligenza Artificiale, Università della Svizzera Italiana, Lugano, Switzerland. e-mail: <email>pape@idsia.ch</email>
</corresp>
</author-notes>
<pub-date pub-type="epub"><day>23</day>
<month>7</month>
<year>2012</year>
</pub-date>
<pub-date pub-type="collection"><year>2012</year>
</pub-date>
<volume>6</volume>
<elocation-id>6</elocation-id>
<history><date date-type="received"><day>16</day>
<month>3</month>
<year>2012</year>
</date>
<date date-type="accepted"><day>27</day>
<month>6</month>
<year>2012</year>
</date>
</history>
<permissions><copyright-statement>Copyright © 2012 Pape, Oddo, Controzzi, Cipriani, Förster, Carrozza and Schmidhuber.</copyright-statement>
<copyright-year>2012</copyright-year>
<license license-type="open-access" xlink:href="http://www.frontiersin.org/licenseagreement"><license-p>This is an open-access article distributed under the terms of the <uri xlink:type="simple" xlink:href="http://creativecommons.org/licenses/by/3.0/">Creative Commons Attribution License</uri>
, which permits use, distribution and reproduction in other forums, provided the original authors and source are credited and subject to any copyright notices concerning any third-party graphics etc.</license-p>
</license>
</permissions>
<abstract><p>We present curiosity-driven, autonomous acquisition of tactile exploratory skills on a biomimetic robot finger equipped with an array of microelectromechanical touch sensors. Instead of building tailored algorithms for solving a specific tactile task, we employ a more general curiosity-driven reinforcement learning approach that autonomously learns a set of motor skills in absence of an explicit teacher signal. In this approach, the acquisition of skills is driven by the information content of the sensory input signals relative to a learner that aims at representing sensory inputs using fewer and fewer computational resources. We show that, from initially random exploration of its environment, the robotic system autonomously develops a small set of basic motor skills that lead to different kinds of tactile input. Next, the system learns how to exploit the learned motor skills to solve supervised texture classification tasks. Our approach demonstrates the feasibility of autonomous acquisition of tactile skills on physical robotic platforms through curiosity-driven reinforcement learning, overcomes typical difficulties of engineered solutions for active tactile exploration and underactuated control, and provides a basis for studying developmental learning through intrinsic motivation in robots.</p>
</abstract>
<kwd-group><kwd>active learning</kwd>
<kwd>biomimetic robotics</kwd>
<kwd>curiosity</kwd>
<kwd>intrinsic motivation</kwd>
<kwd>reinforcement learning</kwd>
<kwd>skill learning</kwd>
<kwd>tactile sensing</kwd>
</kwd-group>
<counts><fig-count count="12"></fig-count>
<table-count count="3"></table-count>
<equation-count count="26"></equation-count>
<ref-count count="36"></ref-count>
<page-count count="16"></page-count>
<word-count count="11377"></word-count>
</counts>
</article-meta>
</front>
<body><sec id="s1"><title>1. Introduction</title>
<p>Complex robots typically require dedicated teams of control engineers that program the robot to execute specific tasks in restricted laboratory settings or other controlled environments. Slight changes in the task requirements or the robot's environment often require extensive re-programming, calibration, and testing to adjust the robot to the changed conditions. The implementation of these tasks could be sped up significantly if the robot autonomously develops and maintains some knowledge about its own capabilities and the structure of the environment in which it lives. Instead of placing the task of supplying the robot with such knowledge in the hands of the robot's creator, <italic>curious</italic>
 robots actively explore their own capabilities and the structure of their environment even <italic>without an externally specified goal</italic>
. The structure found in the environment and its relation to the robot's own actions during curious exploration could be stored and used later to rapidly solve externally-specified tasks.</p>
<p>A formalization of the idea of curious exploratory behavior is found in the work of Schmidhuber (<xref ref-type="bibr" rid="B32">2010</xref>
) and references therein. The theory of intrinsically-motivated learning developed in these works considers active machine <italic>learning agents</italic>
 that try to become more efficient in storing and predicting the observations that follow from their actions. A major realization of Schmidhuber (<xref ref-type="bibr" rid="B32">2010</xref>
), is that curious behavior should not direct the agent toward just any unknown or unexplored part of its environment, but to those parts where it expects to <italic>learn</italic>
 additional patterns or regularities. To this end, the learning agent should keep track of its past learning <italic>progress</italic>
, and find the relation between this progress and its own behavior. Learned behaviors that lead to certain regular or predictable sensory outcomes, can be stored in the form of <italic>skills</italic>
. Bootstrapping the skills learned in this fashion, the agent can discover novel parts of the environment, learn composite complex skills, and quickly find solutions to externally-specified tasks.</p>
<p>This work presents curiosity-driven, autonomous acquisition of tactile exploratory skills on a biomimetic robot finger equipped with an array of microelectromechanical touch sensors. We show that from active, curiosity-driven exploration of its environment, the robotic system autonomously develops a small set of basic motor skills that lead to different kinds of tactile input. Next, the system learns how to exploit the learned motor skills to solve supervised texture classification tasks. Our approach demonstrates the feasibility of autonomous acquisition of tactile skills on physical robotic platforms through curiosity-driven reinforcement learning, overcomes typical difficulties of engineered solutions for tactile exploration and underactuated control, and provides a basis for studying curiosity-driven developmental learning in robots.</p>
<p>Since both theory and practically-feasible algorithms for curiosity-driven machine learning have been developed only recently, few robotic applications of the curiosity-driven approach have been described in the literature thus far. Initial robotic implementations involving vision-based object-interaction tasks are presented in Kompella et al. (<xref ref-type="bibr" rid="B18">2011</xref>
). A similar approach has been described by Gordon and Ahissar (<xref ref-type="bibr" rid="B11">2011</xref>
) for a simulated whisking robot. Examples of alternative approaches to curiosity-driven learning on simple robots or simulators can be found in the work of Oudeyer et al. (<xref ref-type="bibr" rid="B28">2007</xref>
); Vigorito and Barto (<xref ref-type="bibr" rid="B34">2010</xref>
); Konidaris et al. (<xref ref-type="bibr" rid="B19">2011</xref>
); Mugan and Kuipers (<xref ref-type="bibr" rid="B25">2012</xref>
). None of these works consider curiosity-driven development of tactile skills from active tactile exploration.</p>
<p>The rest of this work is organized as follows: section 2 presents the curiosity-driven learning algorithm and the tactile robotic platform used in the experiments. Section 3.1 illustrates the operation of the curiosity-driven reinforcement learning algorithm on a simple toy-problem. The machine learning approach for tactile skill learning presented here has not been published before, and will be described and compared to other approaches where relevant for tactile skill learning. Section 3.2 then shows the learning of basic tactile skills on the robotic platform and how these can be exploited in an externally-specified surface classification task. Section 4 discusses the results and the relevance of active learning in active tactile sensing.</p>
</sec>
<sec id="s2"><title>2. Materials and methods</title>
<sec><title>2.1. Curiosity-driven modular reinforcement learning</title>
<sec><title>2.1.1. Skill learning</title>
<p>The learning of tactile skills is done here within the framework of reinforcement learning (e.g., Kaelbling et al., <xref ref-type="bibr" rid="B17">1996</xref>
). A reinforcement learner (RL) addresses the problem which <italic>actions</italic>
 to take in which <italic>states</italic>
 in order to maximize its cumulative expected <italic>reward</italic>
. The RL is not explicitly taught which actions to take, as in supervised machine learning, but must instead explore the environment to discover which actions yield the most cumulative reward. This might involve taking actions that yield less immediate reward than other actions, but lead to higher reward in the long-term. When using RLs for robot control, states are typically abstract representations of the robot's sensory inputs, actions drive the robot's actuators, and the rewards represent the desirability of the robot's behavior in particular situations. Learning different skills here is done with a <italic>modular</italic>
 reinforcement learning architecture in which each module has its own reward mechanism, and when executed, produces its own behavior.</p>
<p>Most modular reinforcement learning approaches address the question how to split up a particular learning task into subtasks each of which can be learned more easily by a separate module. In the curiosity-driven learning framework presented here, there is no externally-specified task that needs to be solved or divided. Instead, the modules should learn different behaviors based on the structure they discover in the agent's sensory inputs. This is done by reinforcement learning modules that learn behaviors that lead to particular kinds of sensory inputs or events, and then terminate. The different <italic>kinds</italic>
 of sensory events are distinguished by another module, which we here call an abstractor. An abstractor can be any learning algorithm that learns to represent the structure underlying its inputs into a few relevant components. This could for example be an adaptive clustering method, an autoencoder (e.g., Bourlard and Kamp, <xref ref-type="bibr" rid="B5">1988</xref>
), qualitative state representation (Mugan and Kuipers, <xref ref-type="bibr" rid="B25">2012</xref>
) or (including the time domain) a slow-feature analysis (Kompella et al., <xref ref-type="bibr" rid="B18">2011</xref>
). Each component of the abstractor is coupled to a RL module that tries to generate <italic>stable</italic>
 behaviors that lead to sensory inputs with the coupled abstractor state, and then terminates. The resulting modules learn the relation between the part of their sensory inputs that can be directly affected through their own actions, and the abstract structure of their sensory inputs. In other words, the system learns different <italic>skills</italic>
 that specify <italic>what</italic>
 sensory events can occur, and <italic>how</italic>
 to achieve those events. As the behaviors learned by these modules depend on the ability of the system to extract the structure in its sensory input, and not on some externally-provided feedback, we call these modules intrinsically-rewarded skills (inSkills).</p>
<p>Apart from inSkills, we also use externally-rewarded skills (exSkills) that are learned through external reward from the environment, and a small number of other modules whose operation will be detailed below. Modules can take two kinds of actions (1) execute a <italic>primitive</italic>
 that translates directly into an actuator command and (2) execute another module. When the executed module collects sensory inputs with its corresponding abstractor state, it terminates and returns control to the calling module. The possibility of executing another module as part of a skill allows for cumulative learning of more complex, composite skills. In the initial learning stages, there is not much benefit in selecting another module as an action, as most modules have not yet developed behaviors that reliably lead to different sensory events. However, once the modules become specialized, they may become part of the policy of another module. To prevent modules from calling themselves directly, or indirectly via another module, the RL controller keeps track of the selected modules on a calling stack, and removes the currently executing module and its caller modules from the available action set. In this fashion, only modules that are not already on the calling stack can be selected for execution.</p>
<p>It is not uncommon for modular architectures to instantiate additional modules during the learning process. This comes at a disadvantage of specifying and tuning <italic>ad-hoc</italic>
 criteria for module addition and pruning. Instead, we use a learning system with a fixed number of modules, which has to figure out how to assign those modules to the task at hand. Although this system is by definition limited (but so is any physical system), flexibility, and cumulative learning are achieved through the hierarchical combination of modules; once the system acquires a new skill, it could use that skill as part of another skill to perform more complex behaviors.</p>
<p>During curious exploration of its environment, the learning agent is driven by a module that tries to improve the reliability of the inSkill behaviors. The idea of using the learning progress of the agent as reward is closely following the work of Schmidhuber (<xref ref-type="bibr" rid="B32">2010</xref>
). However, the focus here is not so much on the ability of the abstractor predict or compress any observations, but on finding stable divisions of sensory inputs produced by the agent's behavior into a few components. The intrinsic reward is not just the learning progress of the abstractor, but also includes the improvement in the RL's ability to produce the different sensory events distinguished by the abstractor. In essence, the role of learning progress is taken over here by stability progress, which involves the distribution of the agent's limited computational and physical resources such that the most relevant (relative to the system's learning capabilities) sensory events can be reliably produced. This strong relation between distinct sensor abstractions and the ability to learn behaviors that lead to those abstractions has also been argued for by Mugan and Kuipers (<xref ref-type="bibr" rid="B25">2012</xref>
).</p>
</sec>
<sec><title>2.1.2. Adaptive model-based reinforcement learning</title>
<p>Although the abstractors and RLs could in principle be instantiated with a range of different machine learning methods, in practice, especially in robotic practice, few algorithms can be successfully used. The main challenges for curiosity-driven learning on actual hardware are: (1) the algorithms have to learn from much smaller amounts of samples (in the order of 10<sup>2</sup>
–10<sup>3</sup>
) than are typically assumed to be available in the machine learning literature (often more than 10<sup>4</sup>
 to solve even the simplest tasks); (2) typical machine learning approaches assume that training samples are generated from a stationary distribution, while the whole purpose of curiosity-driven learning is to make novel parts of the environment and action space available to the robot <italic>during</italic>
 and <italic>as a result of</italic>
 learning; (3) typical reinforcement learning algorithms assume a stationary distribution of the reward, while the intrinsic reward signal in curiosity-driven learning actually decreases as a result of learning the behavior that leads to this reward. None of these challenges is considered solved in the area of practical machine learning; mathematically optimal universal ways of solving them (Schmidhuber, <xref ref-type="bibr" rid="B31">2006</xref>
) are not practical. In the present work, we employ various machine learning techniques that have been proposed before in the literature, and introduce some new approaches we are not aware of having been described before. The main criterion for choosing the techniques described below was not their theoretical elegance, efficiency, or even optimality, but their robustness to the challenges addressed above.</p>
<p>To learn effectively from the small amount of samples that can be collected from the robotic platform, the learning system trains a Markov model from the collected data, and generates training data for the reinforcement learning algorithm from this model. A Markov model represents the possible states <inline-formula><mml:math id="M1"><mml:mi mathvariant="script">S</mml:mi>
</mml:math>
</inline-formula>
 of its environment as a set of numbers <inline-formula><mml:math id="M2"><mml:mrow><mml:mi>s</mml:mi>
<mml:mo> ∈ </mml:mo>
<mml:mi mathvariant="script">S</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
. In each state, a number of actions <inline-formula><mml:math id="M3"><mml:mrow><mml:mi>a</mml:mi>
<mml:mo> </mml:mo>
<mml:mo>∈</mml:mo>
<mml:mo> </mml:mo>
<mml:mi mathvariant="script">A</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
 are available that lead to (other) states <inline-formula><mml:math id="M4"><mml:mrow><mml:msup><mml:mi>s</mml:mi>
<mml:mo>′</mml:mo>
</mml:msup>
<mml:mo> </mml:mo>
<mml:mo>∈</mml:mo>
<mml:mo> </mml:mo>
<mml:mi mathvariant="script">S</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
 with probability <italic>p</italic>
(<italic>s</italic>
′|<italic>s</italic>
, <italic>a</italic>
). While a primitive action takes one timestep, skills taking several timesteps might also be selected as actions. The model therefore also stores the duration <italic>d</italic>
(<italic>s</italic>
, <italic>a</italic>
, <italic>s</italic>
′) of an action in terms of the number of primitive actions.</p>
<p>The Markov model is further augmented to facilitate learning during the dynamic expansion of the agent's skills and exploration of the environment. For each module <inline-formula><mml:math id="M5"><mml:mrow><mml:msub><mml:mi>ℳ</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
 and each transition (<italic>s</italic>
, <italic>a</italic>
, <italic>s</italic>
′), the model keeps track of: (1) the short-term reward <italic>r</italic>
<sub><italic>j</italic>
</sub>
(<italic>s</italic>
, <italic>a</italic>
, <italic>s</italic>
′) provided by the module's reward system; (2) the probability <italic>z</italic>
<sub><italic>j</italic>
</sub>
(<italic>s</italic>
, <italic>a</italic>
, <italic>s</italic>
′) of terminating the module's policy; (3) the long-term reward <italic>q</italic>
<sub><italic>j</italic>
</sub>
(<italic>s</italic>
, <italic>a</italic>
, <italic>s</italic>′) that changes on a slower timescale than the short-term reward. Instead of accumulating the Markov model's learned values over the whole learning history, all model values are updated with a rule that gives more weight to recently-observed values and slowly forgets observations that happened a long time ago:
<disp-formula id="E1"><label>(1)</label>
<mml:math id="M6"><mml:mrow><mml:mi mathsize="11pt" mathcolor="black">m</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">, </mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">a</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">, </mml:mo>
<mml:msup><mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">′</mml:mo>
</mml:msup>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo mathsize="11pt" mathcolor="black">← </mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mn mathsize="11pt" mathcolor="black">1</mml:mn>
<mml:mo mathsize="11pt" mathcolor="black"> − </mml:mo>
<mml:msub><mml:mi mathsize="11pt" mathcolor="black">w</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">*</mml:mo>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">m</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">, </mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">a</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">, </mml:mo>
<mml:msup><mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">′</mml:mo>
</mml:msup>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo mathsize="11pt" mathcolor="black"> + </mml:mo>
<mml:msub><mml:mi mathsize="11pt" mathcolor="black">w</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">*</mml:mo>
</mml:msub>
<mml:mi mathsize="11pt" mathcolor="black">v</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">, </mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">a</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">, </mml:mo>
<mml:msup><mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">′</mml:mo>
</mml:msup>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo mathsize="11pt" mathcolor="black">,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
with model values <italic>m</italic>
 = {<italic>d</italic>
, <italic>q</italic>
, <italic>r</italic>
, <italic>z</italic>
}, update weights <italic>w</italic>
<sub>*</sub>
 = {<italic>w</italic>
<sub><italic>d</italic>
</sub>
, <italic>w</italic>
<sub><italic>q</italic>
</sub>
, <italic>w</italic>
<sub><italic>r</italic>
</sub>
, <italic>w</italic>
<sub><italic>z</italic>
</sub>
}, and observed values <italic>v</italic>
 = {<italic>d</italic>
, <italic>q</italic>
, <italic>r</italic>
, <italic>z</italic>
}. The short-term rewards, termination probabilities, and transition durations are updated according to Equation 1 for every observation, while the long-term reward is updated for all <italic>q</italic>
(<italic>s</italic>
, <italic>a</italic>
, <italic>s</italic>
′) after processing of a number of samples equal to the reinforcement learning episode length. Transition probabilities are updated by adding a small constant <italic>w</italic>
<sub><italic>p</italic>
</sub>
 to <italic>p</italic>
(<italic>s</italic>
′|<italic>s</italic>
, <italic>a</italic>
), and then rescaled such that <inline-formula><mml:math id="M7"><mml:mrow><mml:mstyle displaystyle="true"><mml:mo>∑</mml:mo>
<mml:mrow><mml:msub><mml:mrow></mml:mrow>
<mml:mrow><mml:msup><mml:mi>s</mml:mi>
<mml:mo>′</mml:mo>
</mml:msup>
<mml:mo>∈</mml:mo>
<mml:mi mathvariant="script">S</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mstyle>
<mml:mi>p</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msup><mml:mi>s</mml:mi>
<mml:mo>′</mml:mo>
</mml:msup>
<mml:mo>|</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>, </mml:mo>
<mml:mi>a</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
. As the model values adjust to the changing skills, previously learned transitions become less likely. For efficiency reasons we prune model values <italic>d</italic>
, <italic>p</italic>
, <italic>r</italic>
, <italic>z</italic>
 for which the transition probabilities have become very small (<italic>p</italic>
(<italic>s</italic>
′|<italic>s</italic>
, <italic>a</italic>
) < <italic>w</italic>
<sub><italic>o</italic>
</sub>
) after each model update. Together, these update rules ensure that the agent keeps adapting to newly acquired skills and changing dynamics of its expanding environment. Increasing (decreasing) the model parameters {<italic>w</italic>
<sub><italic>d</italic>
</sub>
, <italic>w</italic>
<sub><italic>o</italic>
</sub>
, <italic>w</italic>
<sub><italic>q</italic>
</sub>
, <italic>w</italic>
<sub><italic>r</italic>
</sub>
, <italic>w</italic>
<sub><italic>z</italic>
</sub>
} leads to the development of more flexible (more stable) behaviors.</p>
<p>The values stored in the Markov model for each module are used by reinforcement learners to learn policies that maximize the cumulative module rewards. The RLs keep track of how much each state-action pair (<italic>s</italic>
, <italic>a</italic>
) contributes to the cumulative reward <italic>r</italic>
 when following the current action-selection policy. In reinforcement learning these state-action values are known as a <italic>Q</italic>-values:
<disp-formula id="E2"><label>(2)</label>
<mml:math id="M8"><mml:mrow><mml:mi mathsize="11pt" mathcolor="black">Q</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">, </mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">a</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo mathsize="11pt" mathcolor="black">=</mml:mo>
<mml:mstyle displaystyle="true"><mml:munderover><mml:mo mathsize="11pt" mathcolor="black">∑</mml:mo>
<mml:mrow><mml:mi mathsize="11pt" mathcolor="black">t</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">=</mml:mo>
<mml:mn mathsize="11pt" mathcolor="black">0</mml:mn>
</mml:mrow>
<mml:mi mathsize="11pt" mathcolor="black">∞</mml:mi>
</mml:munderover>
<mml:mrow><mml:msup><mml:mi mathsize="11pt" mathcolor="black">γ</mml:mi>
<mml:mi mathsize="11pt" mathcolor="black">t</mml:mi>
</mml:msup>
<mml:mi mathsize="11pt" mathcolor="black">r</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo mathsize="11pt" mathcolor="black">,</mml:mo>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:math>
</disp-formula>
where γ is a discount factor that weights the importance of immediate versus future rewards, and <italic>t</italic>
 is time. The RL selects actions <italic>a</italic>
 in state <italic>s</italic>
 according its current policy π(<italic>s</italic>):
<disp-formula id="E3"><label>(3)</label>
<mml:math id="M9"><mml:mrow><mml:mi mathsize="11pt" mathcolor="black">π</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo mathsize="11pt" mathcolor="black">=</mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">arg</mml:mi>
<mml:munder><mml:mrow><mml:mi mathsize="11pt" mathcolor="black">max</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi mathsize="11pt" mathcolor="black">a</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">∈</mml:mo>
<mml:mi mathvariant="script">A</mml:mi>
</mml:mrow>
</mml:munder>
<mml:msup><mml:mi mathsize="11pt" mathcolor="black">Q</mml:mi>
<mml:mi mathsize="11pt" mathcolor="black">π</mml:mi>
</mml:msup>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">, </mml:mo>
<mml:mo mathsize="11pt" mathcolor="black"> </mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">a</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo mathsize="11pt" mathcolor="black">.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
An efficient algorithm for learning those <italic>Q</italic>
-values is least-squares policy iteration (LSPI; Lagoudakis and Parr (<xref ref-type="bibr" rid="B20">2003</xref>
)). LSPI represents the estimated <italic>Q</italic>
-values as an ω-weighted linear combination of κ features of state-action pairs ɸ(<italic>s</italic>
, <italic>a</italic>):
<disp-formula id="E4"><label>(4)</label>
<mml:math id="M10"><mml:mrow><mml:mover accent="true"><mml:mi mathsize="11pt" mathcolor="black">Q</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">^</mml:mo>
</mml:mover>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">, </mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">a</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo mathsize="11pt" mathcolor="black">=</mml:mo>
<mml:mstyle displaystyle="true"><mml:munderover><mml:mo mathsize="11pt" mathcolor="black">∑</mml:mo>
<mml:mrow><mml:mi mathsize="11pt" mathcolor="black">j</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">=</mml:mo>
<mml:mn mathsize="11pt" mathcolor="black">1</mml:mn>
</mml:mrow>
<mml:mi mathsize="11pt" mathcolor="black">κ</mml:mi>
</mml:munderover>
<mml:mrow><mml:msub><mml:mi mathsize="11pt" mathcolor="black">ɸ</mml:mi>
<mml:mi mathsize="11pt" mathcolor="black">j</mml:mi>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">, </mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">a</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:msub><mml:mi mathsize="11pt" mathcolor="black">ω</mml:mi>
<mml:mi mathsize="11pt" mathcolor="black">j</mml:mi>
</mml:msub>
<mml:mo mathsize="11pt" mathcolor="black">,</mml:mo>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:math>
</disp-formula>
where ω<sub><italic>j</italic>
</sub>
 are the parameters learned by the algorithm. The feature function ɸ(<italic>s</italic>
, <italic>a</italic>
) used here represents state-action pairs as binary feature vectors of length <inline-formula><mml:math id="M11"><mml:mrow><mml:mi>κ</mml:mi>
<mml:mtext> = </mml:mtext>
<mml:mrow><mml:mo>|</mml:mo>
<mml:mi mathvariant="script">S</mml:mi>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:mrow><mml:mo>|</mml:mo>
<mml:mi mathvariant="script">A</mml:mi>
<mml:mo>|</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>
, with a 1 at the index of the corresponding state-action pair, and a 0 at all other indices. LSPI sweeps through a set of <italic>n</italic>
 samples <italic>D</italic>
 = {(<italic>s</italic>
<sub><italic>i</italic>
</sub>
, <italic>a</italic>
<sub><italic>i</italic>
</sub>
, <italic>s</italic>
′<sub><italic>i</italic>
</sub>
, <italic>r</italic>
<sub><italic>i</italic>
</sub>
, <italic>p</italic>
(<italic>s</italic>
′<sub><italic>i</italic>
</sub>
|<italic>s</italic>
<sub><italic>i</italic>
</sub>
, <italic>a</italic>
<sub><italic>i</italic>
</sub>
)) | <italic>i</italic>
 = 1,…,<italic>n</italic>} generated from the model, and updates its estimates of parameter vector ω as:
<disp-formula id="E5"><label>(5)</label>
<mml:math id="M12"><mml:mrow><mml:mi mathsize="11pt" mathcolor="black">ω</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">=</mml:mo>
<mml:msup><mml:mi mathsize="11pt" mathcolor="black">A</mml:mi>
<mml:mrow><mml:mo mathsize="11pt" mathcolor="black">−</mml:mo>
<mml:mn mathsize="11pt" mathcolor="black">1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mi mathsize="11pt" mathcolor="black">b</mml:mi>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="E6"><label>(6)</label>
<mml:math id="M13"><mml:mrow><mml:mi mathsize="11pt" mathcolor="black">A</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">=</mml:mo>
<mml:mstyle displaystyle="true"><mml:munderover><mml:mo mathsize="11pt" mathcolor="black">∑</mml:mo>
<mml:mrow><mml:mi mathsize="11pt" mathcolor="black">i</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">=</mml:mo>
<mml:mn mathsize="11pt" mathcolor="black">1</mml:mn>
</mml:mrow>
<mml:mi mathsize="11pt" mathcolor="black">n</mml:mi>
</mml:munderover>
<mml:mrow><mml:mrow><mml:mo mathsize="11pt" mathcolor="black">[</mml:mo>
<mml:mrow><mml:mi mathsize="11pt" mathcolor="black">ɸ</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub><mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mi mathsize="11pt" mathcolor="black">i</mml:mi>
</mml:msub>
<mml:mo mathsize="11pt" mathcolor="black">, </mml:mo>
<mml:msub><mml:mi mathsize="11pt" mathcolor="black">a</mml:mi>
<mml:mi mathsize="11pt" mathcolor="black">i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:msup><mml:mrow><mml:mrow><mml:mo mathsize="11pt" mathcolor="black">(</mml:mo>
<mml:mrow><mml:mi mathsize="11pt" mathcolor="black">ɸ</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub><mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mi mathsize="11pt" mathcolor="black">i</mml:mi>
</mml:msub>
<mml:mo mathsize="11pt" mathcolor="black">, </mml:mo>
<mml:msub><mml:mi mathsize="11pt" mathcolor="black">a</mml:mi>
<mml:mi mathsize="11pt" mathcolor="black">i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo mathsize="11pt" mathcolor="black">−</mml:mo>
<mml:mstyle displaystyle="true"><mml:munder><mml:mo mathsize="11pt" mathcolor="black">∑</mml:mo>
<mml:mrow><mml:msup><mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">′</mml:mo>
</mml:msup>
<mml:mo mathsize="11pt" mathcolor="black">∈</mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">D</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mrow><mml:mi mathsize="11pt" mathcolor="black">γ</mml:mi>
<mml:mi mathsize="11pt" mathcolor="black">p</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msup><mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">′</mml:mo>
</mml:msup>
<mml:mo mathsize="11pt" mathcolor="black">|</mml:mo>
<mml:msub><mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mi mathsize="11pt" mathcolor="black">i</mml:mi>
</mml:msub>
<mml:mo mathsize="11pt" mathcolor="black">, </mml:mo>
<mml:msub><mml:mi mathsize="11pt" mathcolor="black">a</mml:mi>
<mml:mi mathsize="11pt" mathcolor="black">i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mstyle>
<mml:mi mathsize="11pt" mathcolor="black">ɸ</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msup><mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">′</mml:mo>
</mml:msup>
<mml:mo mathsize="11pt" mathcolor="black">, </mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">π</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msup><mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">′</mml:mo>
</mml:msup>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo mathsize="11pt" mathcolor="black">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mi mathsize="11pt" mathcolor="black">T</mml:mi>
</mml:msup>
</mml:mrow>
<mml:mo mathsize="11pt" mathcolor="black">]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="E7"><label>(7)</label>
<mml:math id="M14"><mml:mrow><mml:mi mathsize="11pt" mathcolor="black">b</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">=</mml:mo>
<mml:mstyle displaystyle="true"><mml:munderover><mml:mo mathsize="11pt" mathcolor="black">∑</mml:mo>
<mml:mrow><mml:mi mathsize="11pt" mathcolor="black">i</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">=</mml:mo>
<mml:mn mathsize="11pt" mathcolor="black">1</mml:mn>
</mml:mrow>
<mml:mi mathsize="11pt" mathcolor="black">n</mml:mi>
</mml:munderover>
<mml:mrow><mml:mrow><mml:mo mathsize="11pt" mathcolor="black">[</mml:mo>
<mml:mrow><mml:mi mathsize="11pt" mathcolor="black">ɸ</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub><mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mi mathsize="11pt" mathcolor="black">i</mml:mi>
</mml:msub>
<mml:mo mathsize="11pt" mathcolor="black">, </mml:mo>
<mml:msub><mml:mi mathsize="11pt" mathcolor="black">a</mml:mi>
<mml:mi mathsize="11pt" mathcolor="black">i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:msub><mml:mi mathsize="11pt" mathcolor="black">r</mml:mi>
<mml:mi mathsize="11pt" mathcolor="black">i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo mathsize="11pt" mathcolor="black">]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mstyle>
<mml:mo mathsize="11pt" mathcolor="black">.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
Because actions with different durations are possible in our implementation, we slightly alter Equations 6 and 7 to take into account the duration <italic>d</italic>
<sub><italic>i</italic>
</sub>
 of a transition, both in the discount factor γ and the module reward <italic>r</italic>
<sub><italic>i</italic>
</sub>:
<disp-formula id="E8"><label>(8)</label>
<mml:math id="M15"><mml:mrow><mml:mi mathsize="11pt" mathcolor="black">A</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">=</mml:mo>
<mml:mstyle displaystyle="true"><mml:munderover><mml:mo mathsize="11pt" mathcolor="black">∑</mml:mo>
<mml:mrow><mml:mi mathsize="11pt" mathcolor="black">i</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">=</mml:mo>
<mml:mn mathsize="11pt" mathcolor="black">1</mml:mn>
</mml:mrow>
<mml:mi mathsize="11pt" mathcolor="black">n</mml:mi>
</mml:munderover>
<mml:mrow><mml:mrow><mml:mo mathsize="11pt" mathcolor="black">[</mml:mo>
<mml:mrow><mml:mi mathsize="11pt" mathcolor="black">ɸ</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub><mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mi mathsize="11pt" mathcolor="black">i</mml:mi>
</mml:msub>
<mml:mo mathsize="11pt" mathcolor="black">, </mml:mo>
<mml:msub><mml:mi mathsize="11pt" mathcolor="black">a</mml:mi>
<mml:mi mathsize="11pt" mathcolor="black">i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:msup><mml:mrow><mml:mrow><mml:mo mathsize="11pt" mathcolor="black">(</mml:mo>
<mml:mrow><mml:mi mathsize="11pt" mathcolor="black">ɸ</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub><mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mi mathsize="11pt" mathcolor="black">i</mml:mi>
</mml:msub>
<mml:mo mathsize="11pt" mathcolor="black">, </mml:mo>
<mml:msub><mml:mi mathsize="11pt" mathcolor="black">a</mml:mi>
<mml:mi mathsize="11pt" mathcolor="black">i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo mathsize="11pt" mathcolor="black">−</mml:mo>
<mml:mstyle displaystyle="true"><mml:munder><mml:mo mathsize="11pt" mathcolor="black">∑</mml:mo>
<mml:mrow><mml:msup><mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">′</mml:mo>
</mml:msup>
<mml:mo mathsize="11pt" mathcolor="black">∈</mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">D</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mrow><mml:msup><mml:mi mathsize="11pt" mathcolor="black">γ</mml:mi>
<mml:mrow><mml:msub><mml:mi mathsize="11pt" mathcolor="black">d</mml:mi>
<mml:mi mathsize="11pt" mathcolor="black">i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:msup>
<mml:mi mathsize="11pt" mathcolor="black">p</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msup><mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">′</mml:mo>
</mml:msup>
<mml:mo mathsize="11pt" mathcolor="black">|</mml:mo>
<mml:msub><mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mi mathsize="11pt" mathcolor="black">i</mml:mi>
</mml:msub>
<mml:mo mathsize="11pt" mathcolor="black">, </mml:mo>
<mml:msub><mml:mi mathsize="11pt" mathcolor="black">a</mml:mi>
<mml:mi mathsize="11pt" mathcolor="black">i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mstyle>
<mml:mi mathsize="11pt" mathcolor="black">ɸ</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msup><mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">′</mml:mo>
</mml:msup>
<mml:mo mathsize="11pt" mathcolor="black">, </mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">π</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msup><mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">′</mml:mo>
</mml:msup>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo mathsize="11pt" mathcolor="black">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mi mathsize="11pt" mathcolor="black">T</mml:mi>
</mml:msup>
</mml:mrow>
<mml:mo mathsize="11pt" mathcolor="black">]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="E9"><label>(9)</label>
<mml:math id="M16"><mml:mrow><mml:mi mathsize="11pt" mathcolor="black">b</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">=</mml:mo>
<mml:mstyle displaystyle="true"><mml:munderover><mml:mo mathsize="11pt" mathcolor="black">∑</mml:mo>
<mml:mrow><mml:mi mathsize="11pt" mathcolor="black">i</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">=</mml:mo>
<mml:mn mathsize="11pt" mathcolor="black">1</mml:mn>
</mml:mrow>
<mml:mi mathsize="11pt" mathcolor="black">n</mml:mi>
</mml:munderover>
<mml:mrow><mml:mrow><mml:mo mathsize="11pt" mathcolor="black">[</mml:mo>
<mml:mrow><mml:mi mathsize="11pt" mathcolor="black">ɸ</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub><mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mi mathsize="11pt" mathcolor="black">i</mml:mi>
</mml:msub>
<mml:mo mathsize="11pt" mathcolor="black">, </mml:mo>
<mml:msub><mml:mi mathsize="11pt" mathcolor="black">a</mml:mi>
<mml:mi mathsize="11pt" mathcolor="black">i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:msup><mml:mi mathsize="11pt" mathcolor="black">γ</mml:mi>
<mml:mrow><mml:msub><mml:mi mathsize="11pt" mathcolor="black">d</mml:mi>
<mml:mi mathsize="11pt" mathcolor="black">i</mml:mi>
</mml:msub>
<mml:mo mathsize="11pt" mathcolor="black">−</mml:mo>
<mml:mn mathsize="11pt" mathcolor="black">1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:msub><mml:mi mathsize="11pt" mathcolor="black">r</mml:mi>
<mml:mi mathsize="11pt" mathcolor="black">i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo mathsize="11pt" mathcolor="black">]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mstyle>
<mml:mo mathsize="11pt" mathcolor="black">.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
</sec>
<sec><title>2.1.3. Skill types</title>
<p>The curiosity-driven learning agent uses four different types of reinforcement learning modules:</p>
<p>An <bold>explorer</bold>
 module is a naive curiosity module that tries to find novel observations around previous novel observations, but does not exploit any further structure of the environment. The reward mechanism of an explorer uses the Markov model to keep track of the number of times a certain state was visited, and rewards transitions (<italic>s</italic>
, <italic>a</italic>
, <italic>s</italic>
′) inverse-proportionally to the number of times <italic>s</italic>
′ was visited: <italic>r</italic>
<sub><italic>j</italic>
</sub>
(<italic>s</italic>
, <italic>a</italic>
, <italic>s</italic>
′) = <italic>e</italic>
<sup>−<italic>c</italic>
(·,·,<italic>s</italic>
′)</sup>
, where <italic>e</italic>
 is the natural logarithm base, and <italic>c</italic>
(·,·,<italic>s</italic>
′) the number of times a transition led to <italic>s</italic>
′. This leads to a policy that drives the agent toward yet unexplored parts of the environment, thus speeding up initial exploration.</p>
<p><bold>InSkill</bold>
 modules exploit regularities in the environment to learn behaviors that lead to particular kinds of sensory events. Sensory inputs are grouped in an unsupervised manner by the abstractor into separate abstractor states <italic>y</italic>
<sub><italic>j</italic>
</sub>
. The behavior that leads to each abstractor state <italic>y</italic>
<sub><italic>j</italic>
</sub>
 is learned by an individual reinforcement learning module <inline-formula><mml:math id="M17"><mml:mrow><mml:msub><mml:mi>ℳ</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
. The reward mechanism of these RLs reflects the reliability with which a transition leads to a particular abstractor state. A reward of 1 is given when transition (<italic>s</italic>
, <italic>a</italic>
, <italic>s</italic>
′) leads to <italic>y</italic>
<sub><italic>j</italic>
</sub>
, and a reward of 0 otherwise. In combination with the update rule in Equation 1, this yields high model reward values <italic>r</italic>
<sub><italic>j</italic>
</sub>
(<italic>s</italic>
, <italic>a</italic>
, <italic>s</italic>
′) for transitions that reliably lead to the corresponding abstractor state <italic>y</italic>
<sub><italic>j</italic>
</sub>
, and low model reward values otherwise. An inSkill terminates when a transition produces its coupled abstractor state. When a module terminates because of other reasons (e.g., reaching the maximum number of allowed timesteps), a failure reward −<italic>w</italic>
<sub><italic>f</italic>
</sub>
 (i.e., penalty) is added to <italic>r</italic>
<sub><italic>j</italic>
</sub>
(<italic>s</italic>
, <italic>a</italic>
, ·). The reason this penalty is given to all <italic>s</italic>
′ ∈ <italic>S</italic>
, is that it is unknown to which state the transition would have led if the module had terminated successfully.</p>
<p>Note that no direct feedback exists between the ability of the abstractor to separate sensory events, and the ability of the RLs to learn behaviors that leads to those events (as is done in some RL approaches). However, there is a behavioral feedback in the sense that the total learning system favors behaviors that lead to those sensory events that can <italic>reliably</italic>
 be distinguished by the abstractor.</p>
<p>The skill <bold>progressor</bold>
 drives the overall behavior of the agent when running in curious exploration mode. The progressor executes those skills that are (re)adapting their expertise. Both increase and decrease in the long-term reward collected by a skill implies it is adjusting to a more stable policy, so the progressor is rewarded for the absolute <italic>change</italic> in long-term reward of the inSkills:
<disp-formula id="E10"><label>(10)</label>
<mml:math id="M18"><mml:mrow><mml:msub><mml:mi mathsize="11pt" mathcolor="black">r</mml:mi>
<mml:mi mathsize="11pt" mathcolor="black">i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mo mathsize="11pt" mathcolor="black">⋅</mml:mo>
<mml:mo mathsize="11pt" mathcolor="black">, </mml:mo>
<mml:msub><mml:mi mathsize="11pt" mathcolor="black">a</mml:mi>
<mml:mi mathsize="11pt" mathcolor="black">i</mml:mi>
</mml:msub>
<mml:mo mathsize="11pt" mathcolor="black">, </mml:mo>
<mml:mo mathsize="11pt" mathcolor="black">⋅</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo mathsize="11pt" mathcolor="black">=</mml:mo>
<mml:mstyle displaystyle="true"><mml:munder><mml:mo mathsize="11pt" mathcolor="black">∑</mml:mo>
<mml:mrow><mml:mo stretchy="false">(</mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">,</mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">a</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">,</mml:mo>
<mml:msup><mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">′</mml:mo>
</mml:msup>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo mathsize="11pt" mathcolor="black">∈</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">S</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">×</mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">A</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">×</mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">S</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:munder>
<mml:mrow><mml:mstyle class="text" mathsize="11pt"><mml:mtext>abs</mml:mtext>
</mml:mstyle>
<mml:mrow><mml:mo mathsize="11pt" mathcolor="black">(</mml:mo>
<mml:mrow><mml:mi mathsize="11pt" mathcolor="black">p</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msup><mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">′</mml:mo>
</mml:msup>
<mml:mo mathsize="11pt" mathcolor="black">|</mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">, </mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">a</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">Δ</mml:mi>
<mml:msub><mml:mi mathsize="11pt" mathcolor="black">q</mml:mi>
<mml:mi mathsize="11pt" mathcolor="black">i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">, </mml:mo>
<mml:mi mathsize="11pt" mathcolor="black">a</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">, </mml:mo>
<mml:msup><mml:mi mathsize="11pt" mathcolor="black">s</mml:mi>
<mml:mo mathsize="11pt" mathcolor="black">′</mml:mo>
</mml:msup>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo mathsize="11pt" mathcolor="black">)</mml:mo>
</mml:mrow>
<mml:mo mathsize="11pt" mathcolor="black">.</mml:mo>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:math>
</disp-formula>
and uses a fixed reward <italic>r</italic>
<sub><italic>x</italic>
</sub>
 for explorer modules.</p>
<p>An <bold>exSkill</bold>
 module learns to maximize externally-provided reward. Just as the other skills, exSkills can choose to execute other skills, thus exploiting skills that have been learned through the intrinsic reward system. Reward is given for reaching a designer-specified goal, which then also terminates the module.</p>
<p>Apart from the termination condition mentioned in the above description, all modules also terminate after a fixed maximum number of actions τ<sub><italic>z</italic>
</sub>
.</p>
<p>All RL modules simultaneously learn from all samples (off-policy). However, modules that execute other modules as part of their own policy, learn about the actual behavior (on-policy) of the executed modules. While off-policy learning facilitates rapid learning of all modules in parallel, it also changes a module's behavior without its explicit execution, leading to potentially incorrect policies in modules that select the changed modules as part of their own policy. This issue is resolved as a side-effect of using a progressor, which is rewarded for, and executes the changing modules, leading to additional sampling of the changed modules until they stabilize.</p>
<p>Apart from the exploration done by the explorer module, a fixed amount of exploration is performed in each module by selecting an untried action from the available action set with probability ∈ instead of the action with the maximum <italic>Q</italic>
-value. In case no untried actions are available for exploration, an action is selected with uniform probability from the available action set. Such a policy is called ∈-greedy.</p>
</sec>
</sec>
<sec><title>2.2. Robotic platform for tactile skill learning</title>
<p>We use the curiosity-driven machine learning framework to investigate curiosity-driven learning of tactile skills on a robotic platform specifically designed for active tactile exploration. The platform (Figure <xref ref-type="fig" rid="F1">1A</xref>
) consists of a robotic finger with a tactile sensor in its fingertip, actuation and processing units, and a housing for replaceable blocks with different surfaces. The details of each of those components are given in the remainder of this section.</p>
<fig id="F1" position="float"><label>Figure 1</label>
<caption><p><bold>Pictures of the experimental setup. (A)</bold>
 Tactile platform with (1) the robotic finger, (2) actuator modules, (3) sensor processing facilities, and (4) housing for replaceable surface blocks. <bold>(B)</bold>
 Surfaces used in tactile skill learning experiments. From top to bottom: two regular-grated plastic surfaces with 320 and 480 μm spacings, paper, and two denim textiles. The highlighted areas show enlargements of 2 × 2 mm areas. <bold>(C)</bold>
 Fingertip with a 2 × 2 tactile sensor array in the highlighted area, covered with finger-printed packaging material. The ruler shows the scale in cm. <bold>(D)</bold>
 Closeup of 2 MEMS tactile sensor units.</p>
</caption>
<graphic xlink:href="fnbot-06-00006-g0001"></graphic>
</fig>
<sec><title>2.2.1. Biomimetic robotic finger</title>
<p>The human-sized (Buchholz et al., <xref ref-type="bibr" rid="B6">1992</xref>
) biomimetic robotic finger used in the active learning experiments is composed of three phalanxes and three flexion joints: a metacarpophalangeal (MCP) joint, a proximal interphalangeal (PIP) joint, and a distal interphalangeal (DIP) joint (see Figure <xref ref-type="fig" rid="F2">2</xref>
). Unlike the natural finger, no abduction of the MCP joint is possible, since the task under investigation (i.e., an exploratory trajectory) requires the fingertip to move in two dimensions only. Like the natural finger, the robotic finger is driven by tendons and underactuated; the three joints are actuated by just two motors. Underactuation reduces design complexity and allows self-adaptation and anthropomorphic movements similar to human exploratory tasks.</p>
<fig id="F2" position="float"><label>Figure 2</label>
<caption><p><bold>Robotic finger actuation; MCP and combined PIP and DIP (PDIP) are shown separately</bold>
.</p>
</caption>
<graphic xlink:href="fnbot-06-00006-g0002"></graphic>
</fig>
<p>The finger is driven by two direct-current (DC) motors (model 1727, Faulhaber Minimotor; gear head ratio 14:1). One motor actively actuates the flexion and extension of the MCP joint by means of two lead screw pairs with opposite screw handedness (agonist-antagonist action, Figure <xref ref-type="fig" rid="F2">2</xref>
). The second motor actuates the flexion of the PIP and DIP underactuated pair (PDIP hereafter) by pulling the tendon. Extension of the PDIP joints during tendon release is achieved passively through torsional springs housed inside the joints. The DC-motors are integrated with optical encoders that monitor the released tendon-length, enabling position control in motor space. Additionally, tension sensors are integrated in the tendons. Each motor is controlled by a low-level motion controller implementing position, tendon tension, torque (motor current) control, and monitoring. The low-level motion controllers are directly controllable by a host PC through a RS232 serial communication bus.</p>
<p>Due to the underactuated architecture and absence of joint-angle sensors, the kinematics of the finger are not unique and can only be solved by considering the dynamics of the robot and its interaction with the environment. Control and monitoring in motor space does not allow for unique control and monitoring in fingertip space, unless the full dynamic model of the finger and its interaction with the touched surface is computed. This makes it difficult to control the finger by means of conventional control strategies (Arai and Tachi, <xref ref-type="bibr" rid="B1">1991</xref>
).</p>
</sec>
<sec><title>2.2.2. Fingertip with MEMS tactile sensor array</title>
<p>The tip of the robotic finger holds a 2 × 2 array of 3D microelectromechanical system (MEMS) tactile sensors (see also Oddo et al., <xref ref-type="bibr" rid="B27">2011b</xref>
) created with silicon microstructuring technologies. Each 1.4 mm<sup>3</sup>
 sensor consists of four piezoresistors implanted at the roots of a cross-shaped structure measuring the displacement of the elevated pin (Figure <xref ref-type="fig" rid="F1">1D</xref>
). The MEMS sensors are placed on a rigid-flex printed circuit board lodged in the fingertip (see Figure <xref ref-type="fig" rid="F1">1C</xref>
). The resulting array has a density of 72 units/cm<sup>2</sup>
 (i.e., 16 channels/22.3 mm<sup>2</sup>
), similar to the 70 units/cm<sup>2</sup>
 of human Merkel mechanoreceptors (Johansson and Vallbo, <xref ref-type="bibr" rid="B15">1979</xref>
).</p>
<p>The piezoresistor output signals are directly (without preamplification) acquired at a frequency of 380 Hz by a 16-channel 24-bit analog-to-digital converter (ADS1258, Texas Instruments) lodged in the distal phalanx. The digital signals acquired from the sensor array via the analog-to-digital converter are encoded as ethernet packets by C/C++ software routines running on a soft-core processor (Nios II, Altera) instantiated onboard a FPGA (Cyclone II, Altera), and broadcasted over an ethernet connection.</p>
<p>The outer packaging layer of the fingertip (Figure <xref ref-type="fig" rid="F1">1C</xref>
) is made of synthetic compliant material (DragonSkin, Smooth-On) and has a surface with fingerprints mimicking the human fingerpad (i.e., 400 μm between-ridge distance; fingerprint curvature radius of 4.8 mm in the center of the sensor array; artificial epidermal ridge-height of 170 μm; total packaging thickness of 770 μm; Oddo et al., <xref ref-type="bibr" rid="B26">2011a</xref>
).</p>
</sec>
<sec><title>2.2.3. Platform</title>
<p>The robotic finger, control modules, and processing hardware are mounted on a platform together with a housing for replaceable surface samples (see Figure <xref ref-type="fig" rid="F1">1A</xref>
). Five different surfaces are used in the tactile skill learning experiments (see Figure <xref ref-type="fig" rid="F1">1B</xref>
): two regular-grated plastic blocks with grating-spacings of 320 and 480 μm (labeled ‘grating 320’ and ‘grating 480’, respectively), a paper surface (labeled ‘paper’), and two different denim textiles (labeled ‘fine textile’ and ‘coarse textile’).</p>
<p>The robotic finger and MEMS sensor are handled by separate control and readout modules. To achieve synchronization between finger movements and tactile sensory readouts, we implemented a real-time, combined sensory-motor driver in Java and .NET, which can be easily interfaced from other programming languages.</p>
</sec>
</sec>
</sec>
<sec id="s3"><title>3. Results</title>
<sec><title>3.1. Example: restricted chain walk</title>
<sec><title>3.1.1. Setup</title>
<p>We illustrate the relevant aspects of the curiosity-driven learning algorithm with a chain walk problem, an often-used toy-problem in reinforcement learning (e.g., Sutton and Barto, <xref ref-type="bibr" rid="B33">1998</xref>
). In the chain walk problem considered here, the learning agent is placed in a simulated environment in which it can move left or right between 20 adjacent states. Going left (right) at the left (right) end of the chain leaves the agent in the same state. The structure of the environment is rather obvious when presented in the manner of Figure <xref ref-type="fig" rid="F3">3</xref>
; however, note that the agent does not know beforehand which actions lead to which states. Instead, it has to <italic>learn</italic>
 the effects of its actions by trying the actions one at a time.</p>
<fig id="F3" position="float"><label>Figure 3</label>
<caption><p><bold>Intrinsic rewards and module policies after 200 learning episodes in the restricted chain walk environment. Top row:</bold>
 normalized intrinsic reward for each module as a function of the reinforcement learning state <italic>s</italic>
′. <bold>Middle row:</bold>
<italic>Q</italic>
-tables for modules that can select only primitive actions, with <italic>Q</italic>
-values (grayscale) maximum values (boxes) and the abstractor's cluster boundaries (vertical lines). <bold>Bottom row:</bold>
<italic>Q</italic>
-tables for modules that can select both primitive actions and inSkills. Black areas in the <italic>Q</italic>
-tables indicate state-action pairs that were never sampled during learning. Each column of plots shows the results for an individual module.</p>
</caption>
<graphic xlink:href="fnbot-06-00006-g0003"></graphic>
</fig>
<p>Learning is done over a number of episodes in which the agent always starts in state 1 (left of each column in Figure <xref ref-type="fig" rid="F3">3</xref>
), and interacts with the environment for a maximum of 25 timesteps. Limiting the chain walk task in this fashion forces the learning agent to address the three machine learning challenges discussed in section 2.1.2: (1) the agent can collect only a limited number of samples before it is sent back to state 1; (2) states cannot be equally sampled, as the agent needs to pass through states closer to state 1 to reach more distant states; (3) larger parts of the input space become available to both the RL and the abstractor as a result of learning, requiring the adjustment of the modules to the increasing input space.</p>
<p>In externally-rewarded chain walk tasks, reaching a particular state or states usually yields a reward. Here, however, we let the agent first explore the chain walk environment without providing any external reward. During this curiosity-driven exploration phase, the sensory input to both the RL and the abstractor is the current state. The abstractor divides the states seen thus far into a number of regions, and the RL modules have to learn policies for reaching each of those regions. In curious-exploration mode the agent thus learns skills for reaching different parts of the environment. These skills can later be used in externally-rewarded tasks where the agent is rewarded for reaching particular states. Instead of retrying all primitive actions starting from state 1 each episode, the agent can then use the learned skills to quickly reach different regions in the environment.</p>
<p>In real-world experiments many processes are going on that affect the learning agent to a certain extent, but cannot all be explicitly represented. The effect of these processes is often referred to as noise. To test the robustness of the learning agent against such noise, we incorporate some random processes in both the environment and the abstractor. To simulate noise in the environment, there is a 10% chance that a primitive action has the reverse effect (going right instead of left, and v.v.), and an additional 10% chance that a primitive has no effect at all (the agent stays in the same state). Abstractor noise is introduced by feeding a randomly selected state (instead of the current state) as input to the abstractor with a 10% chance every timestep.</p>
<p>The abstractor used for skill learning is a simple clustering algorithm that equally divides the states seen thus far into <italic>k</italic>
 parts <italic>y</italic>
<sub>1</sub>
,…,<italic>y</italic>
<sub><italic>k</italic>
</sub>
. The <italic>k</italic>
 corresponding inSkill modules <inline-formula><mml:math id="M19"><mml:mrow><mml:msub><mml:mi>ℳ</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub><mml:mi>ℳ</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
 learn policies for reaching each of those parts. Rapid exploration of the environment is facilitated by an explorer module. The overall behavior of the agent is driven by a progressor module, which receives reward for the long-term inSkill change and a reward of <italic>r</italic>
<sub><italic>x</italic>
</sub>
 = 0.1 for selecting the explorer module. In this fashion, the progressor switches to the explorer module once the stability progress of the inSkills becomes smaller than 0.1. Each episode starts with the execution of the progressor in state 1. The progressor selects to execute a module, which runs until it terminates by itself, or for a maximum of τ<sub><italic>z</italic>
</sub>
 steps in the environment. This is repeated until τ<sub><italic>e</italic>
</sub>
 environment steps (episode length) are executed. At the end of an episode, the samples collected during that episode are used to update the Markov model and the abstractor. Next, the new reinforcement learning policies are generated for each module from the model. A list of all parameter values used for this experiment is given in Table <xref ref-type="table" rid="T1">1</xref>
.</p>
<table-wrap id="T1" position="float"><label>Table 1</label>
<caption><p><bold>Experimental parameters and their values</bold>
.</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="left" rowspan="1" colspan="1"><bold>Symbol</bold>
</th>
<th align="left" rowspan="1" colspan="1"><bold>Description</bold>
</th>
<th align="left" rowspan="1" colspan="1"><bold>Chain walk</bold>
</th>
<th align="left" rowspan="1" colspan="1"><bold>Tactile platform</bold>
</th>
</tr>
</thead>
<tbody><tr><td align="left" rowspan="1" colspan="1"><inline-formula><mml:math id="M20"><mml:mi mathvariant="script">S</mml:mi>
</mml:math>
</inline-formula>
</td>
<td align="left" rowspan="1" colspan="1">state set</td>
<td align="left" rowspan="1" colspan="1">{1,…,20}</td>
<td align="left" rowspan="1" colspan="1">{1,…,36}, see Figure <xref ref-type="fig" rid="F8">8</xref>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"><inline-formula><mml:math id="M21"><mml:mi mathvariant="script">A</mml:mi>
</mml:math>
</inline-formula>
</td>
<td align="left" rowspan="1" colspan="1">action set</td>
<td align="left" rowspan="1" colspan="1">{left, right}</td>
<td align="left" rowspan="1" colspan="1">see Figure <xref ref-type="fig" rid="F8">8</xref>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"><italic>k</italic>
</td>
<td align="left" rowspan="1" colspan="1">number of clusters/inSkill modules</td>
<td align="left" rowspan="1" colspan="1">{4, 7, 10}</td>
<td align="left" rowspan="1" colspan="1">5</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">∈</td>
<td align="left" rowspan="1" colspan="1">reinforcement learning exploration rate</td>
<td align="left" rowspan="1" colspan="1">0.1</td>
<td align="left" rowspan="1" colspan="1">0.1</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">γ</td>
<td align="left" rowspan="1" colspan="1">reinforcement learning discount</td>
<td align="left" rowspan="1" colspan="1">0.95</td>
<td align="left" rowspan="1" colspan="1">0.95</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">κ</td>
<td align="left" rowspan="1" colspan="1">LSPI feature vector length</td>
<td align="left" rowspan="1" colspan="1">20</td>
<td align="left" rowspan="1" colspan="1">36</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"><italic>r</italic>
<sub><italic>x</italic>
</sub>
</td>
<td align="left" rowspan="1" colspan="1">fixed exploration reward</td>
<td align="left" rowspan="1" colspan="1">0.1</td>
<td align="left" rowspan="1" colspan="1">0.2</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">τ<sub><italic>e</italic>
</sub>
</td>
<td align="left" rowspan="1" colspan="1">episode length</td>
<td align="left" rowspan="1" colspan="1">25</td>
<td align="left" rowspan="1" colspan="1">25</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1">τ<sub><italic>z</italic>
</sub>
</td>
<td align="left" rowspan="1" colspan="1">maximum module timesteps</td>
<td align="left" rowspan="1" colspan="1">25</td>
<td align="left" rowspan="1" colspan="1">20</td>
</tr>
<tr><td align="left" colspan="4" rowspan="1"><italic>Markov model update weights</italic>
</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"><italic>w</italic>
<sub><italic>d</italic>
</sub>
</td>
<td align="left" rowspan="1" colspan="1">duration update weight</td>
<td align="left" rowspan="1" colspan="1">0.2</td>
<td align="left" rowspan="1" colspan="1">0.2</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"><italic>w</italic>
<sub><italic>f</italic>
</sub>
</td>
<td align="left" rowspan="1" colspan="1">action failure penalty</td>
<td align="left" rowspan="1" colspan="1">0.1</td>
<td align="left" rowspan="1" colspan="1">0</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"><italic>w</italic>
<sub><italic>o</italic>
</sub>
</td>
<td align="left" rowspan="1" colspan="1">pruning threshold</td>
<td align="left" rowspan="1" colspan="1">0.01</td>
<td align="left" rowspan="1" colspan="1">0.01</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"><italic>w</italic>
<sub><italic>p</italic>
</sub>
</td>
<td align="left" rowspan="1" colspan="1">transition probability update weight</td>
<td align="left" rowspan="1" colspan="1">0.33</td>
<td align="left" rowspan="1" colspan="1">0.33</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"><italic>w</italic>
<sub><italic>q</italic>
</sub>
</td>
<td align="left" rowspan="1" colspan="1">long-term reward update weight</td>
<td align="left" rowspan="1" colspan="1">0.2</td>
<td align="left" rowspan="1" colspan="1">0.2</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"><italic>w</italic>
<sub><italic>r</italic>
</sub>
</td>
<td align="left" rowspan="1" colspan="1">short-term reward update weight</td>
<td align="left" rowspan="1" colspan="1">0.2</td>
<td align="left" rowspan="1" colspan="1">0.2</td>
</tr>
<tr><td align="left" rowspan="1" colspan="1"><italic>w</italic>
<sub><italic>z</italic>
</sub>
</td>
<td align="left" rowspan="1" colspan="1">termination probability update weight</td>
<td align="left" rowspan="1" colspan="1">0.2</td>
<td align="left" rowspan="1" colspan="1">0.2</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec><title>3.1.2. Skill learning</title>
<p>Figure <xref ref-type="fig" rid="F4">4</xref>
 shows an example of the intrinsic reward, the fastest learning modules, and the changing cluster boundaries of the abstractor during curious exploration with four inSkills. As becomes clear from this figure, the agent starts by learning policies for reaching the first few abstractor states (episodes 1–10, until marker (a)). Once it reaches state 15 at marker (a), the agent spends several episodes (11–30, marker (a)–(b)) adjusting modules 3 and 4, which are the modules that take the agent to the rightmost part of the known environment. At episode 30 (marker (b)), the inSkill modules have stabilized (i.e., all inSkills' intrinsic rewards <0.1), and the agent executes the exploration module. Using the learned skills for further exploration of the environment, the exploration module quickly manages to reach state 17 (episode 31, marker (b)). The abstractor adjusts its cluster distribution to the new observations, and the inSkills have to change their policies for reaching those clusters accordingly. This process is repeated at episode 57 (marker (c)), where the exploration policy is selected, and promptly takes the agent to the rightmost state (state 20). Due to the change in the abstractor's distribution, the inSkills change their policies again until their learning progress becomes less than 0.1 (episode 75, marker (d)), and the exploring module takes over. This switching between learning stable behaviors, and exploiting the stabilized behaviors to explore all transitions in the environment goes on until the environment is fully explored. After that, the agent continues to explore while the inSkills remain stable, indicating that the limit of the agent's learning capabilities in the environment is reached.</p>
<fig id="F4" position="float"><label>Figure 4</label>
<caption><p><bold>Modular intrinsic reward (top), fastest learning modules (middle) and the abstractor's cluster boundaries (bottom) during 200 learning episodes in the chain walk environment.</bold>
 The vertical dotted lines at markers (a–d) indicate distinctive learning events as explained in section 3.1.2.</p>
</caption>
<graphic xlink:href="fnbot-06-00006-g0004"></graphic>
</fig>
<p>An example of the final RL policies for the four-inSkill chain walk task is plotted in Figure <xref ref-type="fig" rid="F3">3</xref>
. The top row of this figure shows an equal distribution of the inSkill reward regions over the state space. The second row with the module's policies in case only primitive actions are allowed, makes clear that the inSkills correctly learn to go left (right) when they are to the right (left) of the reward regions. The bottom row of Figure <xref ref-type="fig" rid="F3">3</xref>
 shows the module's policies in case both primitives and inSkills could be selected as actions. Note that an inSkill cannot execute itself, the explorer or the progressor as part of its policy, as indicated by the absence of <italic>Q</italic>
-values in Figure <xref ref-type="fig" rid="F3">3</xref>
. During initial exploration, the selection of other modules happens quite often, because the transition probabilities of primitive actions are not yet sampled reliably. Once the transition probabilities are estimated more accurately during subsequent exploration, primitive actions become preferred over executing other modules, because the on-policy ∈-greedy behavior of the executed modules is less efficient than the optimal policy. After the whole environment is explored, some modules still select other inSkills in certain states. For example, inSkill 1 selects inSkill 2 for going left in states 8–9 and 11–12, and inSkills 1 and 2 select inSkill 3 for going left in states 17 and 18. Again, this is due to the low number of times those state-action (state-module) pairs are sampled during the 200 learning episodes.</p>
<p>The explorer module (right column in Figure <xref ref-type="fig" rid="F3">3</xref>
) learns policies for reaching the least-visited parts of the environment. This is still reflected in the exploring module's <italic>Q</italic>
-values and reward after learning. More reward is obtained in states further away from the starting state 1, and <italic>Q</italic>
-values are increasing with increasing distance from the starting state, because states further away from the starting state are visited less frequently.</p>
</sec>
<sec><title>3.1.3. Skill exploitation</title>
<p>To demonstrate the usefulness of the learned skills in an externally-rewarded chain walk task, we compare an agent with trained inSkills against two other learning agents that have no skill-learning capabilities (1) an agent with no additional modules and (2) an agent with a naive explorer module only. Note that all agents still use an ∈-greedy policy as additional means of exploration. The externally-rewarded task for the agents is to reach any of the states in the furthest region (states 16–20), while starting from state 1. The main challenge is getting to this region by fast and efficient exploration. Once the reward region is reached for the first time, the RLs can usually extract the right policy instantly from the model. Each episode lasts only 25 timesteps, and each module can also run for a maximum of 25 timesteps (see Table <xref ref-type="table" rid="T1">1</xref>
). Together with the 20% chance of primitive failure (10% in the opposite direction and 10% no change) this makes the task particularly challenging. Even when the right policy is learned, the RL might not always reach any of the goal states during an episode due to action noise.</p>
<p>All experiments are repeated 500 times, and the results are averaged. Figure <xref ref-type="fig" rid="F5">5</xref>
 shows the average proportion of the total possible reward achieved as a function of the number of primitive actions taken during learning over 40 episodes. As becomes clear from this figure, the agent with no additional modules takes a long time to reach the target region. Eventually, the ∈-greedy policy will take this agent to the rewarding states, but on average it takes much longer than the 40 training episodes displayed here. The agent with the explorer module learns to reach the rightmost region much faster, because its explorer module drives it to previously unexplored regions. The agents with previously learned inSkills quickly reach the target region by simply selecting one of the previously learned skills that leads there. Agents with more inSkills collect the reward with less training examples because several modules lead to the rewarding region. Due to the difficulty of the task (20% action failure, 10% abstractor noise), it still takes these agents some episodes to reach the target region for the first time (e.g., less than half of the time for the four-inSkill agent during the first episode). However, the agents with previously-learned skills are still much faster in solving the externally-rewarded task than the other agents.</p>
<fig id="F5" position="float"><label>Figure 5</label>
<caption><p><bold>Normalized external reward obtained by different learning agents during training over 40 episodes (1000 primitive actions) in the chain walk task</bold>
.</p>
</caption>
<graphic xlink:href="fnbot-06-00006-g0005"></graphic>
</fig>
</sec>
</sec>
<sec><title>3.2. Curiosity-driven skill learning on the robotic platform</title>
<sec><title>3.2.1. Setup</title>
<p>The curiosity-driven learning algorithm is applied on the robotic tactile platform to learn the movements that lead to different kinds of tactile events. Here, tactile events are encoded as the frequency spectra of MEMS sensor-readouts during 0.33 s finger movements. We filter the MEMS signals with a high-pass filter with lower limit of 0.5 Hz because frequencies below this threshold do not reflect any information about the type (or presence) of sensor-surface contact. Additionally, we filter the signals with a 50 Hz notch filter to suppress power line noise. For various reasons (e.g., location relative to the fingerprints, DragonSkin becoming stuck inside the sensor after intensive use, general wear, 50 Hz distortions), some channels of the MEMS sensor gave less consistent readings than others. The spectra of the three best performing channels selected from visual inspection of the signals are used in the following.</p>
<p>We expect that at least three different tactile events can be distinguished with the robotic platform: (1) movement without sensor-surface contact, which we call free movement, (2) tapping on a surface, and (3) sliding over a surface. To check our expectations, we programmed the finger to perform 50 repetitions of each of these movements in setups with five different surfaces. Figure <xref ref-type="fig" rid="F6">6</xref>
 shows the frequency spectra of the MEMS signals averaged over 50 scripted free, tapping and sliding movements over the surfaces. Sliding movements generate spectra with a low-frequency peak caused by changes in pressure during sliding, and some additional spectral features at higher frequencies: grating 320 has a slight increase in energy around 55 Hz, grating 480 has a peak around 30 Hz, paper has no additional spectral features, fine textile has a peak around 25 Hz, and coarse textile has a peak around 12 Hz. Movements without sensor surface contact (free) yield an almost flat frequency spectrum, while tapping movements lead to spectra with a low-frequency peak and no other significant spectral features.</p>
<fig id="F6" position="float"><label>Figure 6</label>
<caption><p><bold>Average frequency spectra of MEMS recordings during 0.33 s free, tap and slide finger movements for five different surfaces</bold>
.</p>
</caption>
<graphic xlink:href="fnbot-06-00006-g0006"></graphic>
</fig>
<p>The frequency spectra are fed to an abstractor, whose task is to cluster similar sensory events and represent them as discrete tactile states. The abstractor used for distinguishing tactile events is a <italic>k</italic>
-means clustering algorithm (Lloyd, <xref ref-type="bibr" rid="B23">1982</xref>
) that partitions the spectra into <italic>k</italic>
 clusters <italic>y</italic>
<sub>1</sub>
,…,<italic>y</italic>
<sub><italic>k</italic>
</sub>
, with <italic>k</italic>
 ∈ {3, 4, 5}. Although <italic>k</italic>
-means clustering is an unsupervised method, it is still possible to calculate its classification accuracy on free, tapping and sliding events by assigning each cluster to the tactile event with the largest number of samples in that cluster. Figure <xref ref-type="fig" rid="F7">7</xref>
 shows the classification accuracies on free, tapping and sliding events for each surface individually. As becomes clear from this figure, the signals generated during free, tapping and sliding movements can be distinguished from each other by the <italic>k</italic>
-means clusterers with reasonable accuracy (>90%). Using more than three (i.e., the number of different finger movements) clusters helps to better separate the different tactile events, usually because the difference between data generated during different types of sliding movements is larger than the difference between data collected from free and tapping movements.</p>
<fig id="F7" position="float"><label>Figure 7</label>
<caption><p><bold>Clustering accuracies on MEMS frequency spectra during 0.33 s free, tap and slide finger movements for five different surfaces</bold>
.</p>
</caption>
<graphic xlink:href="fnbot-06-00006-g0007"></graphic>
</fig>
<p>The goal of the inSkill modules is to learn behaviors that produce MEMS signals belonging to the corresponding abstractor cluster. To learn finger behaviors, the RLs need some proprioceptive information from the finger, and needs to be able to execute finger movements. We use a representation that might not be optimal for the learning algorithms, but greatly simplifies the graphical presentation of the tactile skill learning. The RLs are fed with the MCP and combined PDIP motor locations, discretized into six positions for each motor, yielding 36 states in total as depicted in Figure <xref ref-type="fig" rid="F8">8</xref>
. The RLs can select from a total of eight primitive actions that set the torque of the MCP and the tension of the PDIP as presented in Figure <xref ref-type="fig" rid="F8">8</xref>
. Each motor primitive lasts a fixed 0.33 s. Unlike the chain walk task, the adjacent states might not be directly reachable from each other, for example, closing the PDIP motor when it is half closed (third or fourth state column in Figure <xref ref-type="fig" rid="F8">8</xref>
) might fully close it at the end of the transition (left state column in Figure <xref ref-type="fig" rid="F8">8</xref>
). Note that many aspects of the robot's dynamics, such as the angles of the underactuated PIP and DIP joints, the precise encoder values, finger movement direction and velocity, cable tension in case of sensor-surface contact, etc., are not captured in this representation, and instead need to be absorbed by Markov model's transition probabilities. More complex representations using more state and action dimensions might facilitate faster learning, but do not lend themselves for an easily understandable presentation of tactile-skill learning.</p>
<fig id="F8" position="float"><label>Figure 8</label>
<caption><p><bold>States and actions of the robotic finger during the reinforcement learning tasks.</bold>
 Left: 6 × 6 areas in normalized encoder-position space represent the discrete reinforcement learning states. One thousand continuous encoder values (gray dots) obtained from a random policy indicate the finger's movement range in the state space. Right: eight motor actions set the PDIP tension and MCP torque.</p>
</caption>
<graphic xlink:href="fnbot-06-00006-g0008"></graphic>
</fig>
<p>The episodic learning scheme in the chain walk task is also applied for the robotic platform. At the episode start the finger is put in randomly selected encoder positions in the range (0.1–0.9) (MCP) and (0.1–0.5) (PDIP), which approximately covers the finger's movement range (see Figure <xref ref-type="fig" rid="F8">8</xref>
). We use slightly shorter episode and module runtime lengths (20) than in the chain walk task to speed up the experiments. A list of all parameter values used for the robotic platform experiment is given in Table <xref ref-type="table" rid="T1">1</xref>
.</p>
<p>We run the curiosity-driven learning agent on the robotic platform using five inSkill modules, a naive explorer and a skill progressor. No external reward is provided to the agent yet. To allow for adaptation of the <italic>k</italic>
-means clusterer during exploration and skill-learning, it is retrained every reinforcement learning episode on a buffer of the last 500 observations. Consistency of the cluster-means between each episode is enforced by initializing the <italic>k</italic>
-means training algorithm with the most recent cluster-means.</p>
</sec>
<sec><title>3.2.2. Skill learning</title>
<p>Figure <xref ref-type="fig" rid="F9">9</xref>
 shows an example of the intrinsic reward of the inSkills during curious exploration of the robotic platform with the coarse textile. The intrinsic reward generated by the progress of the inSkills decreases over time as the agent learns separate behaviors for generating different tactile events. Unlike in the chainwalk task, little switching back and forth between skill learning and exploration occurs. Instead, the agent learns the inSkills without calling the explorer for explicit exploration, because it can easily reach all parts of the environment. After about 65 episodes, the inSkills stabilize, and the exploring modules takes over, with a few short exceptions around episodes 75, 85, and 90. The learning progress in the inSkills around these episodes is due to the finger getting stuck (caused by faulty encoder readouts) in a pose where the sensors generated many samples for a single cluster. The curiosity-driven learning algorithm picks this up as a potentially interesting event, and tries to learn behaviors that reliably lead to such an event. However, after resetting the finger at the end of the episode, the encoders return the correct values again, and the learning agent gradually forgets about the deviating event.</p>
<fig id="F9" position="float"><label>Figure 9</label>
<caption><p><bold>Intrinsic reward (top) and fastest learning modules (bottom) during 100 learning episodes on the robotic platform with the coarse textile</bold>
.</p>
</caption>
<graphic xlink:href="fnbot-06-00006-g0009"></graphic>
</fig>
<p>Figure <xref ref-type="fig" rid="F10">10</xref>
 shows an example of the abstractor clusters and corresponding RL policies after 100 episodes of curiosity-driven learning in a setup with the coarse textile. Comparing the cluster-means of the inSkills learned during curious exploration to the frequency spectra of the scripted free, tapping and sliding movement in Figure <xref ref-type="fig" rid="F6">6</xref>
, it becomes clear that the abstractor has learned a similar division of the MEMS frequency spectra. The almost flat frequency spectrum for inSkill 2 is very similar to the frequency spectrum of the scripted free movements, and the spectrum for inSkill 1 has a similar low-frequency peak as the spectrum of the scripted tapping movement. The spectrum of inSkill 5 is most similar to the sliding spectrum of the coarse textile in Figure <xref ref-type="fig" rid="F6">6</xref>
, but misses the characteristic peak around 12 Hz. The absence of a clear peak for this inSkill is probably due to the combination of several sliding movements at different sensor angles and hence, different sensor-surface speeds, smearing out the spectral peaks over a larger range. The actual behaviors generated by these inSkills and the <italic>Q</italic>
-tables in Figure <xref ref-type="fig" rid="F10">10</xref>
 indicate that the corresponding finger movements are also learned; the <italic>Q</italic>
-values of inSkill 2 (free) have almost the same value throughout the state space, with slightly higher values with the finger away from the surface; inSkill 1 (tap) has two distinctive high <italic>Q</italic>
-values for opening and closing the PDIP joints with the MCP joint halfway closed (middle-left in its <italic>Q</italic>
-table); inSkill 5 (slide) obtains high <italic>Q</italic>
-values with the MCP joint almost closed and opening and closing actions close to the surface (bottom-center in its <italic>Q</italic>
-table).</p>
<fig id="F10" position="float"><label>Figure 10</label>
<caption><p><bold>InSkill sensory clusters and policies after 100 learning episodes in a setup with the coarse textile.</bold>
 Top row: frequency-spectra cluster means of each inSkill. Bottom row: normalized maximum <italic>Q</italic>
-values for each module (grayscale) and best actions (arrows). States and actions are the same as in Figure <xref ref-type="fig" rid="F8">8</xref>
. Black areas without arrows indicate states that were never sampled during learning.</p>
</caption>
<graphic xlink:href="fnbot-06-00006-g0010"></graphic>
</fig>
<p>The specialization of inSkills 3 and 4 is less obvious from Figure <xref ref-type="fig" rid="F10">10</xref>
. However, the actual behavior of the inSkills indicated that inSkill 3 developed into a module for learning slight elastic deformations of the sensor packaging material while opening the finger close to the surface (while not actually touching anything), while inSkill 4 developed behavior that led to similar changes during closing movements.</p>
<p>In setups with the other four surfaces, the skill repertoire learned by the agent also contains distinct behaviors for free, tapping and sliding movements. Apart from these skills, a range of other consistent behaviors were learned, such as behaviors for breaking sensor-surface contact, behaviors that generate elastic deformation of the packaging material after sensor-surface contact, separate skills for sliding forward and sliding backward, hard and soft tapping, and tapping from different angles.</p>
<p>While the specialization of the skills changes during exploration of the environment, the exploration phase often involves the learning of skills in a particular order; first the agent learns to distinguish between free and tapping movements, and as a result of improving its tapping skills, learns a sliding skill as well. During the learning of reliable tapping, the finger makes many movements with the sensor close to the surface. This leads to the discovery of sliding movements, and the learning of the associated sliding skill. The result of this sequence is visible in Figure <xref ref-type="fig" rid="F9">9</xref>
, where the sliding skill (inSkill 5) is the last module that is learned (note that the final specialization happens for the inSkill with the highest number (5) is a coincidence; the order of the clusters is determined randomly).</p>
</sec>
<sec><title>3.2.3. Skill exploitation</title>
<p>After autonomous learning of skills on the robotic platform, we test the usefulness of the learned skills in an externally-rewarded surface-classification task. The task for the robotic finger is to figure out which surface sample is placed on the platform. Instead of programming the finger to slide over the surface, the agent has to learn which of its movements generate the most distinguishing information about the surface sample. We compare the learning agent with previously learned tapping and sliding skills against a learning agent without such previously-learned skills.</p>
<p>To determine the different surface types in the externally-rewarded task, we compare the frequency spectra recorded during each finger movement with previously recorded frequency spectra during sliding movements over the different surfaces, tapping movements, and movements without sensor-surface contact. An external reward of 1 is provided when the recorded spectrum closely matched the frequency spectra of the surface placed on the platform and an external reward of 0 otherwise. After each correct classification, the module ends, and the finger is reset as in an episode start. Although the reward function does not directly represent misclassifications (i.e., the finger can continuously provide misclassification without penalty) due to the limited amount of time in each trial, more reward can be obtained if the finger makes correct classifications sooner.</p>
<p>To give an indication of how difficult it is to distinguish the different surfaces during scripted sliding movements, we provide the frequency spectra recorded during sliding movements as well as during free and tapping movements to a 10-means clusterer, and compute the clustering accuracy as described before. As shown in Figure <xref ref-type="fig" rid="F11">11</xref>
, the overall accuracy of 92% for distinguishing the different surfaces from each other, is not as high as the accuracy of distinguishing between free, tap and slide movements for each surface individually (Figure <xref ref-type="fig" rid="F7">7</xref>
), but still is well above guess chance (14%). Figure <xref ref-type="fig" rid="F11">11</xref>
 further indicates that sliding movements over different surfaces can be accurately distinguished from each other, as well as from tapping and free movements. However, it is more difficult to distinguish sliding movements over paper from tapping movements, probably because the smooth paper surface produces almost no distinctive spectral features (compare also the frequency spectra for sliding over paper and tapping in Figure <xref ref-type="fig" rid="F7">7</xref>
). Using slightly different numbers of clusters (between 7 and 15) changed the accuracies with only a few percentages.</p>
<fig id="F11" position="float"><label>Figure 11</label>
<caption><p><bold>Confusion matrix for surface-type classification using a 10-means clusterer on frequency spectra recorded during pre-scripted free, tapping and sliding movements.</bold>
 Background colors indicate the number of samples assigned to each class. The number and percentage of correctly (incorrectly) classified samples are indicated in black (white) text.</p>
</caption>
<graphic xlink:href="fnbot-06-00006-g0011"></graphic>
</fig>
<p>During the externally-rewarded task, we keep training the skills learned during the curious exploration phase for two reasons: (1) skills used during autonomous exploration might be useful in quickly solving an externally-specified task, but might not solve it directly. For example, a sliding skill might lead to sensor data that distinguishes a sliding movement from a tapping movement in a particular setup, but might not necessarily be good for distinguishing between different surface types. Still, some kind of sliding movement is probably required to distinguish between different surface types. An existing sliding skill could be easily adjusted to make slightly different movements that are better for distinguishing between surface types. We therefore add the external reward to inSkill modules that were active when the external reward was received, and adjust the modules' models and policies accordingly. (2) The dynamics of the robotic platform change during operation for various reasons (e.g., cable stretch, changes in ambient temperature and battery levels, general wear). While this might require repeated calibration in traditional approaches, the learning system used here is flexible enough to cope with those changes.</p>
<p>Learning in the externally-rewarded tasks is done over 30 episodes and repeated three times for each of the five surfaces described in section 2.2.3. Figure <xref ref-type="fig" rid="F12">12</xref>
 shows the average reward during training. As shown in this figure, agents that have previously learned inSkills learn to solve the externally-rewarded task much faster than the agent without such previously learned skills. The skills learned during the curious exploration phase are useful for the externally-rewarded task, but often do not solve it directly. Instead, the skills need to be (slightly) adjusted from skills that distinguish free, tap and slide movements for individual surfaces, into sliding movements that distinguish different surfaces. This skill adjustment is reflected in the increasing reward of the agent with previously-learned inSkills while it is learning to solve the externally-rewarded task.</p>
<fig id="F12" position="float"><label>Figure 12</label>
<caption><p><bold>External reward obtained by different learning agents during training over 30 episodes in the surface-classification task</bold>
.</p>
</caption>
<graphic xlink:href="fnbot-06-00006-g0012"></graphic>
</fig>
</sec>
</sec>
</sec>
<sec id="s4"><title>4. Discussion</title>
<p>We presented a curiosity-driven modular reinforcement learning framework for autonomous learning of tactile skills on a tactile robotic platform. The learning algorithm was able to differentiate distinct <italic>tactile events</italic>
, while simultaneously learning <italic>behaviors</italic>
 for generating those tactile events. The tactile <italic>skills</italic>
 learned in this fashion allowed for rapid learning of an externally-specified surface classification task. Our results highlight two important aspects of active tactile sensing: (1) exploratory tactile skills can be <italic>learned</italic>
 through intrinsic motivation (2) using previously-acquired tactile skills, an agent can learn which exploratory policies yield the most relevant tactile information about a presented surface.</p>
<p>A key aspect of the developmental learning system presented here is the ability to use previously-learned skills for reaching novel parts of the environment, and to combine skills into more complex composed skills. This bootstrapping of skills became apparent in the chain walk task, where modules used other modules to reach parts of the environment in case the transition probabilities of primitive actions were not accurately known. Also, during curious exploration of the tactile platform, the agent first learned to move the finger without sensor-surface contact, then learned to tap the finger on the surface, and finally learned the more difficult skill of sliding the finger over the surface while maintaining sensor-surface contact. After learning these skills, the agents kept exploring the environment in search for further things to learn, while maintaining a stable division of skills learned thus far.</p>
<p>The notion of <italic>active</italic>
 tactile sensing has recently been discussed in Prescott et al. (<xref ref-type="bibr" rid="B29">2011</xref>
), who considered different interpretations: (1) the energetic interpretation, in which the information-relevant energy flow is from the sensor to the outer world being sensed; (2) the kinetic interpretation, in which the sensor touches rather than is being touched; and (3) the preferred interpretation by Prescott et al. (<xref ref-type="bibr" rid="B29">2011</xref>
), which considers active sensing systems as purposive and information-seeking, involving control of the sensor apparatus in whatever manner suits the task. Our work fits best with the third interpretation, because tactile information drives both the learning of tactile exploratory skills and the categorization of tactile stimuli. Note, however, that the exploratory dynamics are not directly used as kinaesthetic information provided to the texture classifier, but rather enter in the categorization chain as affecting sensor outputs. In future work, it would be interesting to study if and how tactile and kinaesthetic information could be fused for motor control and perceptual purposes during learning of exploratory skills.</p>
<p>A potentially interesting comparison could be made between the usage of sensory information during tactile exploration in humans and in the biomimetic robotic setup. In our experiments the algorithms were able to distinguish between different textures using the key spectral features of the sensor output. The human neuronal mechanisms and contributions of the different types of mechanoreceptors for distinguishing textural details (Yoshioka et al., <xref ref-type="bibr" rid="B36">2007</xref>
) are still highly debated. No agreement has been reached about the most informative mechanoreceptors (i.e., among Merkel, Meissner, Ruffini, and Pacini corpuscles) or about the coding strategy (e.g., temporal, spatial, spatiotemporal, intensity) used by humans to represent textural information. Various studies aimed at demonstrating that the Pacinian system encodes fine textures (Hollins et al., <xref ref-type="bibr" rid="B13">2001</xref>
; Bensmaïa and Hollins, <xref ref-type="bibr" rid="B2">2003</xref>
; Bensmaia and Hollins, <xref ref-type="bibr" rid="B3">2005</xref>
). In particular, Hollins and Risner (<xref ref-type="bibr" rid="B12">2000</xref>
) supported the Katz's duplex theory, according to which fine textures are supposed to be mediated by different classes of mechanoreceptors via vibrational cues for fine forms and via spatial cues for coarse forms. Conversely, Johnson and colleagues presented human psychophysical studies and complementary electrophysiological results with monkeys supporting a unified peripheral neural mechanism for roughness encoding of both coarse and fine stimuli, based on the spatial variation in the firing rate of Slowly Adapting type I afferents (SAI; Merkel) (Connor et al., <xref ref-type="bibr" rid="B7">1990</xref>
; Connor and Johnson, <xref ref-type="bibr" rid="B8">1992</xref>
; Blake et al., <xref ref-type="bibr" rid="B4">1997</xref>
; Yoshioka et al., <xref ref-type="bibr" rid="B35">2001</xref>
). Johansson and Flanagan (<xref ref-type="bibr" rid="B14">2009</xref>
) introduced a hypothetical model of tactile coding based on coincidence detection of neural events, which may describe the neuronal mechanism along the human somatosensory chain from tactile receptors, passing through cuneate neurons up to the somatosensory cortex. What has been agreed on is that humans can detect up to microtextures (LaMotte and Srinivasan, <xref ref-type="bibr" rid="B21">1991</xref>
), and that the human perception of textures is severely degraded in case of lack of tangential motion between the fingertip and the tactile stimuli (Morley et al., <xref ref-type="bibr" rid="B24">1983</xref>
; Gardner and Palmer, <xref ref-type="bibr" rid="B10">1989</xref>
; Radwin et al., <xref ref-type="bibr" rid="B30">1993</xref>
; Jones and Lederman, <xref ref-type="bibr" rid="B16">2006</xref>
). This consolidated finding fits well the results presented in the current work: like human beings, the robotic finger also developed skills for sliding motions tangential to the tactile stimuli while seeking for information-rich experiences.</p>
<p>Recently, Fishel and Loeb (<xref ref-type="bibr" rid="B9">2012</xref>
) obtained impressive texture classification accuracies on a large range of different textures, using an algorithm that selects the most discriminative exploratory motions from a set of tangential sliding movements with different forces and velocities. <italic>That</italic>
 variations in high-level motion parameters like force and velocity are important for obtaining the most distinctive information is, however, not inferred by their learning algorithm. Our approach first learns how to make exploratory movements without any teacher feedback and without any knowledge of high-level parameters such as sensor-surface force and velocity. As in Fishel and Loeb (<xref ref-type="bibr" rid="B9">2012</xref>
), our algorithm then learns to select exploratory movements that yield the most distinctive information about the presented textures. Whereas Fishel and Loeb (<xref ref-type="bibr" rid="B9">2012</xref>
) learn to select pre-scripted exploratory movements, our algorithm can still refine the previously-learned exploratory movements during the learning of the supervised texture classification task.</p>
<p>A further comparison could be made between the exploratory behaviors learned by the biomimetic platform, and the learning of tactile exploratory procedures by human beings. There is a large body of literature about the exploratory procedures employed by humans when investigating an objects, including texture (e.g., Lederman and Klatzky, <xref ref-type="bibr" rid="B22">2009</xref>
, and references therein). Tapping and sliding tangentially over a surface are both used by human beings and learned by the robotic platform when gathering tactile information. Apart from using or selecting <italic>existing</italic>
 exploratory procedures it could also be interesting to study similarities in how these exploratory procedures are <italic>learned</italic>
 in human beings in the first place. The constraints of the biomimetic robotic finger make tapping easier to learn than sliding. Consequently, sliding is often learned after and as a result of tapping. Similar constraints in human beings might lead the same developmental trend (from tapping to sliding).</p>
<p>Although the robotic finger has just two controllable degrees of freedom, learning skills in an autonomous fashion already proved to be beneficial during learning of an additional externally-specified task. Moreover, the learning approach allowed for overcoming challenges in traditional engineered solutions to robotic control, such as the need of constant recalibration of the robotic platform to changing circumstances, and the absence of joint-angle sensors in the underactuated joints. We expect that autonomous acquisition of skills in robots will become increasingly important for autonomous learning in robots with more degrees of freedom and sensory capabilities.</p>
<sec><title>Conflict of interest statement</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
</sec>
</body>
<back><ack><p>This work was funded by the EU under projects NanoBioTouch (FP7-NMP-228844), IM-CLeVeR (FP7-ICT-231722), NanoBioTact (FP6-NMP-033287), and SmartHand (FP6-NMP4-CT2006-33423). Leo Pape acknowledges his colleagues at IDSIA for the many discussions about curiosity-driven learning in theory and practice. Calogero M. Oddo and Maria C. Carrozza acknowledge Dr. Lucia Beccai from CMBR IIT@SSSA for previous collaboration in the development of the biomimetic fingertip, and Prof. Johan Wessberg from the Department of Physiology of University of Gothenburg for meaningful discussions on the human somatosensory system. The authors thank the two reviewers for their constructive comments.</p>
</ack>
<ref-list><title>References</title>
<ref id="B1"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Arai</surname>
<given-names>H.</given-names>
</name>
<name><surname>Tachi</surname>
<given-names>S.</given-names>
</name>
</person-group>
 (<year>1991</year>
). <article-title>Position control of a manipulator with passive joints using dynamic coupling</article-title>
. <source>IEEE Trans. Rob. Autom</source>
. <volume>7</volume>
, <fpage>528</fpage>
–<lpage>534</lpage>
</mixed-citation>
</ref>
<ref id="B2"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Bensmaïa</surname>
<given-names>S. J.</given-names>
</name>
<name><surname>Hollins</surname>
<given-names>M.</given-names>
</name>
</person-group>
 (<year>2003</year>
). <article-title>The vibrations of texture</article-title>
. <source>Somatosens. Mot. Res</source>
. <volume>20</volume>
, <fpage>33</fpage>
–<lpage>43</lpage>
<pub-id pub-id-type="doi">10.1080/0899022031000083825</pub-id>
<pub-id pub-id-type="pmid">12745443</pub-id>
</mixed-citation>
</ref>
<ref id="B3"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Bensmaia</surname>
<given-names>S. J.</given-names>
</name>
<name><surname>Hollins</surname>
<given-names>M.</given-names>
</name>
</person-group>
 (<year>2005</year>
). <article-title>Pacinian representations of fine surlace texture</article-title>
. <source>Percept. Psychophys</source>
. <volume>67</volume>
, <fpage>842</fpage>
–<lpage>854</lpage>
<pub-id pub-id-type="pmid">16334056</pub-id>
</mixed-citation>
</ref>
<ref id="B4"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Blake</surname>
<given-names>D. T.</given-names>
</name>
<name><surname>Hsiao</surname>
<given-names>S. S.</given-names>
</name>
<name><surname>Johnson</surname>
<given-names>K. O.</given-names>
</name>
</person-group>
 (<year>1997</year>
). <article-title>Neural coding mechanisms in tactile pattern recognition; the relative contributions of slowly and rapidly adapting mechanoreceptors to perceived roughness</article-title>
. <source>J. Neurosci</source>
. <volume>17</volume>
, <fpage>7480</fpage>
–<lpage>7489</lpage>
<pub-id pub-id-type="pmid">9295394</pub-id>
</mixed-citation>
</ref>
<ref id="B5"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Bourlard</surname>
<given-names>H.</given-names>
</name>
<name><surname>Kamp</surname>
<given-names>Y.</given-names>
</name>
</person-group>
 (<year>1988</year>
). <article-title>Auto-association by multilayer perceptrons and singular value decomposition</article-title>
. <source>Biol. Cybern</source>
. <volume>59</volume>
, <fpage>291</fpage>
–<lpage>294</lpage>
<pub-id pub-id-type="pmid">3196773</pub-id>
</mixed-citation>
</ref>
<ref id="B6"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Buchholz</surname>
<given-names>B.</given-names>
</name>
<name><surname>Armstrong</surname>
<given-names>T. J.</given-names>
</name>
<name><surname>Goldstein</surname>
<given-names>S. A.</given-names>
</name>
</person-group>
 (<year>1992</year>
). <article-title>Anthropometric data for describing the kinematics of the human hand</article-title>
. <source>Ergonomics</source>
<volume>35</volume>
, <fpage>261</fpage>
–<lpage>273</lpage>
<pub-id pub-id-type="doi">10.1080/00140139208967812</pub-id>
<pub-id pub-id-type="pmid">1572336</pub-id>
</mixed-citation>
</ref>
<ref id="B7"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Connor</surname>
<given-names>C. E.</given-names>
</name>
<name><surname>Hsiao</surname>
<given-names>S. S.</given-names>
</name>
<name><surname>Phillips</surname>
<given-names>J. R.</given-names>
</name>
<name><surname>Johnson</surname>
<given-names>K. O.</given-names>
</name>
</person-group>
 (<year>1990</year>
). <article-title>Tactile roughness: neural codes that account for psychophysical magnitude estimates</article-title>
. <source>J. Neurosci</source>
. <volume>10</volume>
, <fpage>3823</fpage>
–<lpage>3826</lpage>
<pub-id pub-id-type="pmid">2269886</pub-id>
</mixed-citation>
</ref>
<ref id="B8"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Connor</surname>
<given-names>E.</given-names>
</name>
<name><surname>Johnson</surname>
<given-names>K. O.</given-names>
</name>
</person-group>
 (<year>1992</year>
). <article-title>Neural coding of tactile texture: comparison of spatial and temporal mechanisms for roughness perception</article-title>
. <source>J. Neurosci</source>
. <volume>12</volume>
, <fpage>3414</fpage>
–<lpage>3426</lpage>
<pub-id pub-id-type="pmid">1527586</pub-id>
</mixed-citation>
</ref>
<ref id="B9"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Fishel</surname>
<given-names>J. A.</given-names>
</name>
<name><surname>Loeb</surname>
<given-names>G. E.</given-names>
</name>
</person-group>
 (<year>2012</year>
). <article-title>Bayesian exploration for intelligent identification of textures</article-title>
. <source>Front. Neurorobot</source>
. <volume>6</volume>
:<issue>4</issue>
<pub-id pub-id-type="doi">10.3389/fnbot.2012.00004</pub-id>
<pub-id pub-id-type="pmid">22783186</pub-id>
</mixed-citation>
</ref>
<ref id="B10"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Gardner</surname>
<given-names>E. P.</given-names>
</name>
<name><surname>Palmer</surname>
<given-names>C. I.</given-names>
</name>
</person-group>
 (<year>1989</year>
). <article-title>Simulation of motion of the skin. I. Receptive fields and temporal frequency coding by cutaneous mechanoreceptors of OPTACON pulses delivered to the hand</article-title>
. <source>J. Neurophysiol</source>
. <volume>62</volume>
, <fpage>1410</fpage>
–<lpage>1436</lpage>
<pub-id pub-id-type="pmid">2600632</pub-id>
</mixed-citation>
</ref>
<ref id="B11"><mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Gordon</surname>
<given-names>G.</given-names>
</name>
<name><surname>Ahissar</surname>
<given-names>E.</given-names>
</name>
</person-group>
 (<year>2011</year>
). <article-title>Reinlorcement active learning hierarchical loops</article-title>
, in <source>International Joint Conference of Neural Networks (IJCNN)</source>
, (<publisher-loc>San Jose, CA</publisher-loc>
).</mixed-citation>
</ref>
<ref id="B12"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Hollins</surname>
<given-names>M.</given-names>
</name>
<name><surname>Risner</surname>
<given-names>S. R.</given-names>
</name>
</person-group>
 (<year>2000</year>
). <article-title>Evidence for the duplex theory of tactile texture perception</article-title>
. <source>Percept. Psychophys</source>
. <volume>62</volume>
, <fpage>695</fpage>
–<lpage>705</lpage>
<pub-id pub-id-type="pmid">10883578</pub-id>
</mixed-citation>
</ref>
<ref id="B13"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Hollins</surname>
<given-names>M.</given-names>
</name>
<name><surname>Bensmaia</surname>
<given-names>S. J.</given-names>
</name>
<name><surname>Washburn</surname>
<given-names>S.</given-names>
</name>
</person-group>
 (<year>2001</year>
). <article-title>Vibrotactile adaptation impairs discrimination of fine, but not coarse, textures</article-title>
. <source>Somatosens. Mot. Res</source>
. <volume>18</volume>
, <fpage>253</fpage>
–<lpage>262</lpage>
<pub-id pub-id-type="pmid">11794728</pub-id>
</mixed-citation>
</ref>
<ref id="B14"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Johansson</surname>
<given-names>R. S.</given-names>
</name>
<name><surname>Flanagan</surname>
<given-names>J. R.</given-names>
</name>
</person-group>
 (<year>2009</year>
). <article-title>Coding and use of tactile signals from the fingertips in object manipulation tasks</article-title>
. <source>Nat. Rev. Neurosci</source>
. <volume>10</volume>
, <fpage>345</fpage>
–<lpage>359</lpage>
<pub-id pub-id-type="doi">10.1038/nrn2621</pub-id>
<pub-id pub-id-type="pmid">19352402</pub-id>
</mixed-citation>
</ref>
<ref id="B15"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Johansson</surname>
<given-names>R. S.</given-names>
</name>
<name><surname>Vallbo</surname>
<given-names>A. B.</given-names>
</name>
</person-group>
 (<year>1979</year>
). <article-title>Tactile sensibility in the human hand: relative and absolute densities of four types of mechanoreceptive units in glabrous skin</article-title>
. <source>J. Physiol</source>
. <volume>286</volume>
, <fpage>283</fpage>
–<lpage>300</lpage>
<pub-id pub-id-type="pmid">439026</pub-id>
</mixed-citation>
</ref>
<ref id="B16"><mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Jones</surname>
<given-names>L. A.</given-names>
</name>
<name><surname>Lederman</surname>
<given-names>S. J.</given-names>
</name>
</person-group>
 (<year>2006</year>
). <article-title>Tactile sensing</article-title>
, in <source>Human Hand Function</source>
, eds <person-group person-group-type="editor"><name><surname>Jones</surname>
<given-names>L. A.</given-names>
</name>
<name><surname>Lederman</surname>
<given-names>S. J.</given-names>
</name>
</person-group>
 (<publisher-loc>New York, NY</publisher-loc>
: <publisher-name>Oxford University Press</publisher-name>
), <fpage>44</fpage>
–<lpage>74</lpage>
</mixed-citation>
</ref>
<ref id="B17"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Kaelbling</surname>
<given-names>L. P.</given-names>
</name>
<name><surname>Littman</surname>
<given-names>M. L.</given-names>
</name>
<name><surname>Moore</surname>
<given-names>A. W.</given-names>
</name>
</person-group>
 (<year>1996</year>
). <article-title>Reinlorcement learning: a survey</article-title>
. <source>J. AI Res</source>
. <volume>4</volume>
, <fpage>237</fpage>
–<lpage>285</lpage>
</mixed-citation>
</ref>
<ref id="B18"><mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Kompella</surname>
<given-names>V. R.</given-names>
</name>
<name><surname>Pape</surname>
<given-names>L.</given-names>
</name>
<name><surname>Masci</surname>
<given-names>J.</given-names>
</name>
<name><surname>Frank</surname>
<given-names>M.</given-names>
</name>
<name><surname>Schmidhuber</surname>
<given-names>J.</given-names>
</name>
</person-group>
 (<year>2011</year>
). <article-title>AutoIncSFA and vision-based developmental learning for humanoid robots</article-title>
, in <source>IEEE-RAS International Conference on Humanoid Robots</source>
, (<publisher-loc>Bled, Slovenia</publisher-loc>
).</mixed-citation>
</ref>
<ref id="B19"><mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Konidaris</surname>
<given-names>G. D.</given-names>
</name>
<name><surname>Kuindersma</surname>
<given-names>S. R.</given-names>
</name>
<name><surname>Grupen</surname>
<given-names>R. A.</given-names>
</name>
<name><surname>Barto</surname>
<given-names>A. G.</given-names>
</name>
</person-group>
 (<year>2011</year>
). <article-title>Autonomous skill acquisition on a mobile manipulator</article-title>
, in <source>Proceedings of the Twenty-Fifth Conference on Artificial Intelligence (AAAI-11)</source>
, (<publisher-loc>San Francisco, CA</publisher-loc>
).</mixed-citation>
</ref>
<ref id="B20"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Lagoudakis</surname>
<given-names>M. G.</given-names>
</name>
<name><surname>Parr</surname>
<given-names>R.</given-names>
</name>
</person-group>
 (<year>2003</year>
). <article-title>Least-squares policy iteration</article-title>
. <source>J. Mach. Learn. Res</source>
. <volume>4</volume>
, <fpage>1107</fpage>
–<lpage>1149</lpage>
</mixed-citation>
</ref>
<ref id="B21"><mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>LaMotte</surname>
<given-names>R. H.</given-names>
</name>
<name><surname>Srinivasan</surname>
<given-names>M. A.</given-names>
</name>
</person-group>
 (<year>1991</year>
). <article-title>Surlace microgeometry: neural encoding and perception. Information processing in the somatosensory system</article-title>
, in <source>Wenner-Gren International. Symposium Series</source>
, eds <person-group person-group-type="editor"><name><surname>Franzen</surname>
<given-names>O.</given-names>
</name>
<name><surname>Westman</surname>
<given-names>J.</given-names>
</name>
</person-group>
 (<publisher-loc>New York, NY</publisher-loc>
: <publisher-name>Macmillan Press</publisher-name>
), <fpage>49</fpage>
–<lpage>58</lpage>
</mixed-citation>
</ref>
<ref id="B22"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Lederman</surname>
<given-names>S. J.</given-names>
</name>
<name><surname>Klatzky</surname>
<given-names>R. L.</given-names>
</name>
</person-group>
 (<year>2009</year>
). <article-title>Haptic perception: a tutorial</article-title>
. <source>Atten. Percept. Psychophys</source>
. <volume>71</volume>
, <fpage>1439</fpage>
–<lpage>1459</lpage>
<pub-id pub-id-type="doi">10.3758/APP.71.7.1439</pub-id>
<pub-id pub-id-type="pmid">19801605</pub-id>
</mixed-citation>
</ref>
<ref id="B23"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Lloyd</surname>
<given-names>S. P.</given-names>
</name>
</person-group>
 (<year>1982</year>
). <article-title>Least squares quantization in PCM</article-title>
. <source>IEEE Trans. Inf. Theory</source>
<volume>28</volume>
, <fpage>129</fpage>
–<lpage>137</lpage>
</mixed-citation>
</ref>
<ref id="B24"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Morley</surname>
<given-names>J. W.</given-names>
</name>
<name><surname>Goodwin</surname>
<given-names>A. W.</given-names>
</name>
<name><surname>Darian-Smith</surname>
<given-names>I.</given-names>
</name>
</person-group>
 (<year>1983</year>
). <article-title>Tactile discrimination of gratings</article-title>
. <source>Exp. Brain Res</source>
. <volume>49</volume>
, <fpage>291</fpage>
–<lpage>299</lpage>
<pub-id pub-id-type="pmid">6832261</pub-id>
</mixed-citation>
</ref>
<ref id="B25"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Mugan</surname>
<given-names>J.</given-names>
</name>
<name><surname>Kuipers</surname>
<given-names>B.</given-names>
</name>
</person-group>
 (<year>2012</year>
). <article-title>Autonomous learning of high-level states and actions in continuous environments</article-title>
. <source>IEEE Trans. Auton. Ment. Dev</source>
. <volume>4</volume>
, <fpage>70</fpage>
–<lpage>86</lpage>
</mixed-citation>
</ref>
<ref id="B26"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Oddo</surname>
<given-names>C. M.</given-names>
</name>
<name><surname>Beccai</surname>
<given-names>L.</given-names>
</name>
<name><surname>Wessberg</surname>
<given-names>J.</given-names>
</name>
<name><surname>Backlund Wasling</surname>
<given-names>H.</given-names>
</name>
<name><surname>Mattioli</surname>
<given-names>F.</given-names>
</name>
<name><surname>Carrozza</surname>
<given-names>M. C.</given-names>
</name>
</person-group>
 (<year>2011a</year>
). <article-title>Roughness encoding in human and biomimetic artificial touch: spatiotemporal frequency modulation and structural anisotropy of fingerprints</article-title>
. <source>Sensors</source>
<volume>11</volume>
, <fpage>5596</fpage>
–<lpage>5615</lpage>
<pub-id pub-id-type="doi">10.3390/s110605596</pub-id>
<pub-id pub-id-type="pmid">22163915</pub-id>
</mixed-citation>
</ref>
<ref id="B27"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Oddo</surname>
<given-names>C. M.</given-names>
</name>
<name><surname>Controzzi</surname>
<given-names>M.</given-names>
</name>
<name><surname>Beccai</surname>
<given-names>L.</given-names>
</name>
<name><surname>Cipriani</surname>
<given-names>C.</given-names>
</name>
<name><surname>Carrozza</surname>
<given-names>M. C.</given-names>
</name>
</person-group>
 (<year>2011b</year>
). <article-title>Roughness encoding for discrimination of surlaces in artificial active touch</article-title>
. <source>IEEE Trans. Rob</source>
. <volume>27</volume>
, <fpage>522</fpage>
–<lpage>533</lpage>
</mixed-citation>
</ref>
<ref id="B28"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Oudeyer</surname>
<given-names>P. Y.</given-names>
</name>
<name><surname>Kaplan</surname>
<given-names>F.</given-names>
</name>
<name><surname>Hafner</surname>
<given-names>V.</given-names>
</name>
</person-group>
 (<year>2007</year>
). <article-title>Intrinsic motivation systems for autonomous mental development</article-title>
. <source>IEEE Trans. Auton. Ment. Dev</source>
. <volume>11</volume>
, <fpage>265</fpage>
–<lpage>286</lpage>
</mixed-citation>
</ref>
<ref id="B29"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Prescott</surname>
<given-names>T. J.</given-names>
</name>
<name><surname>Diamond</surname>
<given-names>M. E.</given-names>
</name>
<name><surname>Wing</surname>
<given-names>A. M.</given-names>
</name>
</person-group>
 (<year>2011</year>
). <article-title>Active touch sensing</article-title>
. <source>Philos. Trans. R. Soc. Lond. B Biol. Sci</source>
. <volume>366</volume>
, <fpage>2989</fpage>
–<lpage>2995</lpage>
<pub-id pub-id-type="doi">10.1098/rstb.2011.0167</pub-id>
<pub-id pub-id-type="pmid">21969680</pub-id>
</mixed-citation>
</ref>
<ref id="B30"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Radwin</surname>
<given-names>R.</given-names>
</name>
<name><surname>Jeng</surname>
<given-names>O.</given-names>
</name>
<name><surname>Gisske</surname>
<given-names>E.</given-names>
</name>
</person-group>
 (<year>1993</year>
). <article-title>A new automated tactility test instrument for evaluating hand sensory function</article-title>
. <source>IEEE Trans. Rehabil. Eng</source>
. <volume>1</volume>
, <fpage>220</fpage>
–<lpage>225</lpage>
</mixed-citation>
</ref>
<ref id="B31"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Schmidhuber</surname>
<given-names>J.</given-names>
</name>
</person-group>
 (<year>2006</year>
). <article-title>Developmental robotics, optimal artificial curiosity, creativity, music, and the fine arts</article-title>
. <source>Connect. Sci</source>
. <volume>18</volume>
, <fpage>173</fpage>
–<lpage>187</lpage>
</mixed-citation>
</ref>
<ref id="B32"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Schmidhuber</surname>
<given-names>J.</given-names>
</name>
</person-group>
 (<year>2010</year>
). <article-title>Formal theory of creativity, fun, and intrinsic motivation (1990–2010)</article-title>
. <source>IEEE Trans. Auton. Ment. Dev</source>
. <volume>2</volume>
, <fpage>230</fpage>
–<lpage>247</lpage>
</mixed-citation>
</ref>
<ref id="B33"><mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Sutton</surname>
<given-names>R.</given-names>
</name>
<name><surname>Barto</surname>
<given-names>A.</given-names>
</name>
</person-group>
 (<year>1998</year>
). <source>Reinforcement Learning: An Introduction</source>
. <publisher-loc>Cambridge, MA</publisher-loc>
: <publisher-name>MIT Press</publisher-name>
<pub-id pub-id-type="doi">10.1016/j.neunet.2008.09.004</pub-id>
</mixed-citation>
</ref>
<ref id="B34"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Vigorito</surname>
<given-names>C. M.</given-names>
</name>
<name><surname>Barto</surname>
<given-names>A. G.</given-names>
</name>
</person-group>
 (<year>2010</year>
). <article-title>Autonomous learning of high-level states and actions in continuous environments</article-title>
. <source>IEEE Trans. Auton. Ment. Dev</source>
. <volume>2</volume>
, <fpage>132</fpage>
–<lpage>143</lpage>
</mixed-citation>
</ref>
<ref id="B35"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Yoshioka</surname>
<given-names>T.</given-names>
</name>
<name><surname>Gibb</surname>
<given-names>B.</given-names>
</name>
<name><surname>Dorsch</surname>
<given-names>A. K.</given-names>
</name>
<name><surname>Hsiao</surname>
<given-names>S. S.</given-names>
</name>
<name><surname>Johnson</surname>
<given-names>K. O.</given-names>
</name>
</person-group>
 (<year>2001</year>
). <article-title>Neural coding mechanisms underlying perceived roughness of finely textured surfaces</article-title>
. <source>J. Neurosci</source>
. <volume>21</volume>
, <fpage>6905</fpage>
–<lpage>6916</lpage>
<pub-id pub-id-type="pmid">11517278</pub-id>
</mixed-citation>
</ref>
<ref id="B36"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Yoshioka</surname>
<given-names>T.</given-names>
</name>
<name><surname>Bensmaia</surname>
<given-names>S. J.</given-names>
</name>
<name><surname>Craig</surname>
<given-names>J. C.</given-names>
</name>
<name><surname>Hsiao</surname>
<given-names>S. S.</given-names>
</name>
</person-group>
 (<year>2007</year>
). <article-title>Texture perception through direct and indirect touch: an analysis of perceptual space for tactile textures in two modes of exploration</article-title>
. <source>Somatosens. Mot. Res</source>
. <volume>24</volume>
, <fpage>53</fpage>
–<lpage>70</lpage>
<pub-id pub-id-type="doi">10.1080/08990220701318163</pub-id>
<pub-id pub-id-type="pmid">17558923</pub-id>
</mixed-citation>
</ref>
</ref-list>
<glossary><def-list><title>Glossary of symbols.</title>
<def-item><term><inline-formula><mml:math id="M22"><mml:mi mathvariant="script">A</mml:mi>
</mml:math>
</inline-formula>
</term>
<def><p>action set</p>
</def>
</def-item>
<def-item><term><italic>A</italic>
</term>
<def><p>LSPI update matrix</p>
</def>
</def-item>
<def-item><term><italic>a</italic>
</term>
<def><p>action</p>
</def>
</def-item>
<def-item><term><italic>b</italic>
</term>
<def><p>LSPI update vector</p>
</def>
</def-item>
<def-item><term><italic>c</italic>
</term>
<def><p>transition count in Markov model</p>
</def>
</def-item>
<def-item><term><italic>D</italic>
</term>
<def><p>set of samples</p>
</def>
</def-item>
<def-item><term><italic>d</italic>
</term>
<def><p>transition duration</p>
</def>
</def-item>
<def-item><term><italic>e</italic>
</term>
<def><p>base of natural logarithm</p>
</def>
</def-item>
<def-item><term><italic>i</italic>
</term>
<def><p>index</p>
</def>
</def-item>
<def-item><term><italic>j</italic>
</term>
<def><p>module index</p>
</def>
</def-item>
<def-item><term><italic>k</italic>
</term>
<def><p>number of clusters / inSkill modules</p>
</def>
</def-item>
<def-item><term><inline-formula><mml:math id="M23"><mml:mi mathvariant="script">M</mml:mi>
</mml:math>
</inline-formula>
</term>
<def><p>reinforcement learning module</p>
</def>
</def-item>
<def-item><term><italic>m</italic>
</term>
<def><p>Markov model value</p>
</def>
</def-item>
<def-item><term><italic>n</italic>
</term>
<def><p>number of samples</p>
</def>
</def-item>
<def-item><term><italic>o</italic>
</term>
<def><p>Markov model pruning threshold</p>
</def>
</def-item>
<def-item><term><italic>Q</italic>
</term>
<def><p>reinforcement learning state-action value</p>
</def>
</def-item>
<def-item><term><italic>p</italic>
</term>
<def><p>transition probability</p>
</def>
</def-item>
<def-item><term><italic>q</italic>
</term>
<def><p>long-term reward</p>
</def>
</def-item>
<def-item><term><italic>r</italic>
</term>
<def><p>short-term reward</p>
</def>
</def-item>
<def-item><term><italic>s</italic>
, <italic>s</italic>
′</term>
<def><p>state</p>
</def>
</def-item>
<def-item><term><inline-formula><mml:math id="M24"><mml:mi mathvariant="script">S</mml:mi>
</mml:math>
</inline-formula>
</term>
<def><p>state set</p>
</def>
</def-item>
<def-item><term><italic>t</italic>
</term>
<def><p>time</p>
</def>
</def-item>
<def-item><term><italic>v</italic>
</term>
<def><p>observed value</p>
</def>
</def-item>
<def-item><term><italic>w</italic>
</term>
<def><p>Markov model update weight</p>
</def>
</def-item>
<def-item><term><italic>y</italic>
</term>
<def><p>abstractor output</p>
</def>
</def-item>
<def-item><term><italic>z</italic>
</term>
<def><p>termination probability</p>
</def>
</def-item>
<def-item><term>Δ</term>
<def><p>difference operator</p>
</def>
</def-item>
<def-item><term>∈</term>
<def><p>reinforcement learning exploration rate</p>
</def>
</def-item>
<def-item><term>γ</term>
<def><p>reinforcement learning discount</p>
</def>
</def-item>
<def-item><term>κ</term>
<def><p>LSPI feature vector length</p>
</def>
</def-item>
<def-item><term>π</term>
<def><p>reinforcement learning policy</p>
</def>
</def-item>
<def-item><term>ɸ</term>
<def><p>LSPI feature vector</p>
</def>
</def-item>
<def-item><term>ω</term>
<def><p>LSPI weight</p>
</def>
</def-item>
<def-item><term>τ<sub><italic>e</italic>
</sub>
</term>
<def><p>episode length</p>
</def>
</def-item>
<def-item><term>τ<sub><italic>z</italic>
</sub>
</term>
<def><p>maximum module timesteps</p>
</def>
</def-item>
</def-list>
<def-list><title>Glossary of acronyms.</title>
<def-item><term>DC</term>
<def><p>direct current</p>
</def>
</def-item>
<def-item><term>DIP</term>
<def><p>distal interphalangeal</p>
</def>
</def-item>
<def-item><term>exSkill</term>
<def><p>externally-rewarded skill</p>
</def>
</def-item>
<def-item><term>inSkill</term>
<def><p>intrinsically-motivated skill</p>
</def>
</def-item>
<def-item><term>MCP</term>
<def><p>metacarpophalangeal</p>
</def>
</def-item>
<def-item><term>MEMS</term>
<def><p>microelectromechanical system</p>
</def>
</def-item>
<def-item><term>PDIP</term>
<def><p>combined PIP and DIP joints</p>
</def>
</def-item>
<def-item><term>PIP</term>
<def><p>proximal interphalangeal</p>
</def>
</def-item>
<def-item><term>RL</term>
<def><p>reinforcement learner</p>
</def>
</def-item>
</def-list>
</glossary>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/HapticV1/Data/Pmc/Curation

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001E32 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd -nk 001E32 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    HapticV1
   |flux=    Pmc
   |étape=   Curation
   |type=    RBID
   |clé=     PMC:3401897
   |texte=   Learning tactile skills through curious exploration
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Curation/RBID.i   -Sk "pubmed:22837748" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd   \
       | NlmPubMed2Wicri -a HapticV1

This area was generated with Dilib version V0.6.23.
Data generation: Mon Jun 13 01:09:46 2016. Site generation: Wed Mar 6 09:54:07 2024

	Serveur d'exploration sur les dispositifs haptiques
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur les dispositifs haptiques

Learning tactile skills through curious exploration

Learning tactile skills through curious exploration

Source :

Abstract

Links toward previous steps (curation, corpus...)

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri

Pour générer des pages wiki