AplhaGo remarks

20 Apr 2016

Article can be found in nature

How I understand some parts

We play games between the current policy network pρ and a randomly selected previous iteration of the policy network
- The same neural network NN(params_t) after training from time t play again NN(params_{sample(1…t])}
Self training is pretty cool idea
- Hard to apply it in dialogues
  - no easy to compute reward
  - much harder to model user variance than just limited system responses