数据挖掘 - 如何改进策略梯度的 tensorflow 2.0 代码？ - 吾爱随笔录

我重新创建了一些我在网上找到的代码，用于使用策略梯度解决强盗问题。该示例在 tensorflow 1.0 中，因此我使用 tensorflow 2.0 使用急切执行和梯度磁带重新创建它，但是，在训练模型时，我必须将权重 Tensor 转换为 numpy 数组，更新权重然后重新分配 tf.Variable 从numpy 数组。我觉得这不是性能，可以找到更好的方法。完整代码在这里https://github.com/entrpn/reinforcement-learning/blob/master/tf2_rl/bandits.py

我希望改进的主要代码如下：

def train(agent,action,reward, learning_rate=0.001):
    with tf.GradientTape() as t:
        current_loss = loss(agent(action),reward)
    dW = t.gradient(current_loss,[agent.weights])
    weights_as_np = agent.weights.numpy()
    responsible_weight = agent.weights[action]
    responsible_weight_dw = np.array(dW)[0][action]

    weights_as_np[action] = weights_as_np[action] - learning_rate*responsible_weight_dw

    agent.weights.assign(tf.Variable(weights_as_np))