AplhaGo remarks
Article can be found in nature
How I understand some parts
- We play games between the current policy network pρ and a randomly selected previous iteration of the policy network
- The same neural network NN(params_t) after training from time t play again NN(params_{sample(1…t])}
- Self training is pretty cool idea
- Hard to apply it in dialogues
- no easy to compute reward
- much harder to model user variance than just limited system responses
- Hard to apply it in dialogues