Ondřej Plátek Blog
PhD candidate@UFAL, Prague. LLM & TTS evaluation. Engineer. Researcher. Speaker. Father.

AplhaGo remarks

Article can be found in nature

How I understand some parts

  • We play games between the current policy network pρ and a randomly selected previous iteration of the policy network
    • The same neural network NN(params_t) after training from time t play again NN(params_{sample(1…t])}
  • Self training is pretty cool idea
    • Hard to apply it in dialogues
      • no easy to compute reward
      • much harder to model user variance than just limited system responses