Posted on Sat 21 October 2017

AlphaGo Zero

Usually in software, version numbers tend to go up, not down. With AlphaGo Zero, we did the opposite - by taking out handcrafted human knowledge, we ended up with both a simpler and more beautiful algorithm and a stronger Go program.

We provide a full description in our paper, Mastering the game of Go without human knowledge, which you can also read online.

At the core is a self-improvement loop based on self-play and Monte Carlo Tree Search (MCTS): We start with a randomly initialized network, then use this network in the MCTS to play the first games. The network is continuously trained on the latest played games, and recent snapshots of the network are used to play new games.

Of course, the initial games will be made up of completely random moves (as the network doesn't know anything yet). However, even these random games already serve to teach the network how to estimate the winner, at least very close to the end of the game. The next version of the network might therefore still not know very much about how to play the game, but it can somewhat estimate the winner for very late endgame moves, and so the search is then able to pick better moves just before the end of the game.

This process continues over and over again, with the network gaining new insights from the latest rounds of play, which in turn cause the MCTS to pick better moves, which improves the network, etc.

You can also see some cool graphics and two videos with my colleague David Silver in the [cached]DeepMind blog post.

Tags: ai, programming, go

© Julian Schrittwieser. Built using Pelican. Theme by Giulio Fidente on github. .