KNOWLEDGE TRANSFER IN DEEP REINFORCEMENT LEARN- ING neural networks are able to detect patterns in...

download KNOWLEDGE TRANSFER IN DEEP REINFORCEMENT LEARN- ING neural networks are able to detect patterns in the

of 101

  • date post

    05-Aug-2020
  • Category

    Documents

  • view

    0
  • download

    0

Embed Size (px)

Transcript of KNOWLEDGE TRANSFER IN DEEP REINFORCEMENT LEARN- ING neural networks are able to detect patterns in...

  • Graduation thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Applied Science and Engi- neering: Computer Science

    KNOWLEDGE TRANSFER IN DEEP REINFORCEMENT LEARN- ING

    Arno Moonens

    Academic Year: 2016-2017

    Promotor: Prof. Dr. Peter Vrancx

    Science and Bio-Engineering Sciences

  • Proefschrift ingediend met het oog op het behalen van de graad van Master of Science in Applied Science and Engineering: Com- puter Science

    KENNISOVERDRACHT BIJ DEEP REINFORCEMENT LEARNING

    Arno Moonens

    Academiejaar: 2016-2017

    Promotor: Prof. Dr. Peter Vrancx

    Wetenschappen en Bio-ingenieurswetenschappen

  • Abstract

    Deep reinforcement learning allows for learning how to perform a task using high- dimensional input such as images by trial-and-error. However, it is sometimes necessary to learn variations of a task, for which certain knowledge can be transferred. In this thesis, we learn multiple variations of a task in parallel using both shared knowledge and task-specific knowledge. This knowledge is then transferred to a new task. In ex- periments, we saw that learning first in parallel on a set of source tasks significantly improves performance on a new task compared to not learning on source tasks or only using one. We also found that it is beneficial to start learning on a new task using task-specific knowledge of a source task.

    i

  • ii

  • Acknowledgments

    I would like to thank my promotor Prof. Dr. Peter Vrancx for always giving useful feedback and helping me when I was stuck. This thesis would not have been possible without his expertise and the inspiration he offered. I would also like to thank my friends and family for always supporting me in this and all other endeavors.

    iii

  • iv

  • Contents

    1 Introduction 1

    2 Artificial neural networks 3 2.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    2.2.1 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.2 Sigmoid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.3 Hyperbolic tangent . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.4 Rectified Linear Unit . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.5 Softmax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2.3 Gradient descent and backpropagation . . . . . . . . . . . . . . . . . . . . 9 2.3.1 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.2 Stochastic gradient descent . . . . . . . . . . . . . . . . . . . . . . 12 2.3.3 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.4 Extensions and improvements . . . . . . . . . . . . . . . . . . . . . 15

    3 Reinforcement learning 19 3.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Dynamic programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 Monte Carlo and Temporal-Difference . . . . . . . . . . . . . . . . . . . . 21 3.4 Eligibility traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.5 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.6 Policy gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.7 Generalization and function approximation . . . . . . . . . . . . . . . . . 32

    3.7.1 Coarse coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    4 Deep learning 37 4.1 Convolutional neural networks . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    5 Deep reinforcement learning 43 5.1 DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.2 Continuous control with deep reinforcement learning . . . . . . . . . . . . 45

    v

  • vi CONTENTS

    5.3 Asynchronous Methods for Deep Reinforcement Learning . . . . . . . . . 46

    6 Transfer learning 49 6.1 Transfer learning dimensions . . . . . . . . . . . . . . . . . . . . . . . . . 49 6.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 6.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    7 Proposed algorithm 57

    8 Experimental setup 61 8.1 Cart-pole environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 8.2 Acrobot environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 8.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 8.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    8.4.1 Parallel and sequential knowledge transfer . . . . . . . . . . . . . . 64 8.4.2 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 8.4.3 Usage of a different amount of source tasks . . . . . . . . . . . . . 66 8.4.4 Transfer of sparse representation . . . . . . . . . . . . . . . . . . . 69 8.4.5 REINFORCE using a source and target task . . . . . . . . . . . . 72

    9 Conclusion 75

    Appendices 77

    A Experiment details 79 A.1 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 A.2 Artificial neural network parameters . . . . . . . . . . . . . . . . . . . . . 79 A.3 Environment parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

  • List of Figures

    2.1 An artificial neural network . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 The perceptron unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 The sigmoid function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4 Hyperbolic tangent function. . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.5 Rectified Linear Unit function . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.6 Leaky ReLU function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.7 An example of an error surface . . . . . . . . . . . . . . . . . . . . . . . . 11

    3.1 n-step returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 TD forward view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3 TD backward view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4 Generalization between points . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5 Shapes of receptive fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    4.1 An example of a 3-by-3 filter applied to an image . . . . . . . . . . . . . . 38 4.2 Convolutional neural network layout . . . . . . . . . . . . . . . . . . . . . 39 4.3 Convolutional neural network features hierarchy . . . . . . . . . . . . . . . 39 4.4 Recurrent neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.5 Long short-term memory cell . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.6 Gated Recurrent Unit cell . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    6.1 Transfer learning metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    7.1 Artificial neural network architecture used in our approach . . . . . . . . 58

    8.1 Visualization of the cart-pole environment . . . . . . . . . . . . . . . . . . 62 8.2 Visualization of the acrobot environment . . . . . . . . . . . . . . . . . . . 63 8.3 Learning curves for the parallel and sequential version of our algorithm

    applied to the cart-pole environment . . . . . . . . . . . . . . . . . . . . . 64 8.4 Learning curves for the transfer learning algorithm applied to the cart-

    pole environment with no feature extraction in its neural network, one using a layer with 5 units and one using a layer of 10 units. . . . . . . . . 65

    8.5 Learning curves for REINFORCE and TLA for the Cart-pole environment 66

    vii

  • viii LIST OF FIGURES

    8.6 Learning curves for the acrobot environment of REINFORCE and TLA for the Acrobot environment . . . . . . . . . . . . . . . . . . . . . . . . . . 68

    8.7 Boxplots of the asymptotic performances of REINFORCE, TLA 5 and TLA 10 using the acrobot environment. . . . . . . . . . . . . . . . . . . . 69

    8.8 Learning curves for the cart-pole environment of TLA with and without sparse representation transfer . . . . . . . . . . . . . . . . . . . . . . . . . 70

    8.9 Learning curves for the acrobot environment of TLA with and without sparse representation transfer . . . . . . . . . . . . . . . . . . . . . . . . . 71

    8.10 REINFORCE applied for 100 epochs to a randomly chosen source task of a cart-pole environment and afterwards to the target task, using the same network and weight values. The learning curve is compared to the one of the TLA 5 algorithm that uses sparse representation transfer. . . . 73

    8.11 REINFORCE applied for 100 epochs to a randomly chosen source task of an acrobot and afterwards to the target task, using the same network and weight values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

  • List of Algorithms

    1 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    2 Sarsa(λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3 Watkins’ Q(λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4 REINFORCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5 QAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .