我重新创建了一些我在网上找到的代码,用于使用策略梯度解决强盗问题。该示例在 tensorflow 1.0 中,因此我使用 tensorflow 2.0 使用急切执行和梯度磁带重新创建它,但是,在训练模型时,我必须将权重 Tensor 转换为 numpy 数组,更新权重然后重新分配 tf.Variable 从numpy 数组。我觉得这不是性能,可以找到更好的方法。完整代码在这里https://github.com/entrpn/reinforcement-learning/blob/master/tf2_rl/bandits.py
我希望改进的主要代码如下:
def train(agent,action,reward, learning_rate=0.001):
with tf.GradientTape() as t:
current_loss = loss(agent(action),reward)
dW = t.gradient(current_loss,[agent.weights])
weights_as_np = agent.weights.numpy()
responsible_weight = agent.weights[action]
responsible_weight_dw = np.array(dW)[0][action]
weights_as_np[action] = weights_as_np[action] - learning_rate*responsible_weight_dw
agent.weights.assign(tf.Variable(weights_as_np))