AlphaZero

This was an old implementation of OpenSpiel based on TF1. There is another other based on C++ LibTorch. This document covers mostly the old TF-based implementation and common components. For the Libtorch-based implementation, see here.

Disclaimer: this is not the code that was used for the Go challenge matches or the AlphaZero paper results. It is a re-implementation for illustrative purposes, and although it can handle games like Connect Four, it is not designed to scale to superhuman performance in Go or Chess.

Background

AlphaZero is an algorithm for training an agent to play perfect information games from pure self-play. It uses Monte Carlo Tree Search (MCTS) with the prior and value given by a neural network to generate training data for that neural network.

Links to relevant articles/papers:

AlphaGo Zero: Starting from scratch has an open access link to the AlphaGo Zero nature paper that describes the model in detail.
AlphaZero: Shedding new light on chess, shogi, and Go has an open access link to the AlphaZero science paper that describes the training regime and generalizes to more games.

Overview:

The Python and C++ implementations are conceptually fairly similar, and have roughly the same components: actors that generate data through self-play using MCTS with an evaluator that uses a neural network, a learner that updates the network based on those games, and evaluators playing vs standard MCTS to gauge progress. Both write checkpoints that can be played independently of the training setup, and logs that can be analyzed programmatically.

The Python implementation uses one process per actor/evaluator, doesn’t support batching for inference and does all inference and training on the cpu. The C++ implementation, by contrast, uses threads, a shared cache, supports batched inference, and can do both inference and training on GPUs. As such the C++ implementation can take advantage of additional hardware and can train significantly faster.

Model

The model defined in open_spiel/python/algorithms/alpha_zero/model.py is used by both the python and C++ implementations.

The model defines three architectures in decreasing complexity:

resnet: same as the AlphaGo/AlphaZero paper when set with width 256 and depth 20.
conv2d: same as the resnet except uses a conv+batchnorm+relu instead of the residual blocks.
mlp: same as conv2d except uses dense layers instead of conv, and drops batch norm.

The model is parameterized by the size of the observations and number of actions for the game you specify, so can play any 2-player game. The conv2d and resnet models are restricted to games with a 2d representation (ie a 3d observation tensor).

The models are all parameterized with a width and depth:

The depth is the number of blocks in the torso, where the definition of a block varies by model. For a resnet it’s a resblock which is two conv2ds, batch norms and relus, and an addition. For conv2d it’s a conv2d, a batch norm and a relu. For mlp it’s a dense plus relu.
The width is the number of filters for any conv2d and the number of hidden units for any dense layer.

The networks all give two outputs: a value and a policy, which are used by the MCTS evaluator.

MCTS

Monte Carlo Tree Search (MCTS) is a general search algorithm used to play many games, but first found success playing Go back in ~2005. It builds a tree directed by random rollouts, and does usually uses UCT to direct the exploration/exploitation tradeoff. For our use case we replace random rollouts with a value network. Instead of a uniform prior we use a policy network. Instead of UCT we use PUCT.

We have implementations of MCTS in C++ and python.

MCTS Evaluator

Both MCTS implementations above have a configurable evaluator that returns the value and prior policy of a given node. For standard MCTS the value is given by random rollouts, and the prior policy is uniform. For AlphaZero the value and prior are given by a neural network evaluation. The AlphaZero evaluator takes a model, so can be used during training or with a trained checkpoint for play with open_spiel/python/examples/mcts.py.

Actors

The main script launches a set of actor processes (Python) or threads (C++). The actors create two MCTS instances with a shared evaluator and model, and play self-play games, passing the trajectories to the learner via a queue. The more actors the faster it can generate training data, assuming you have sufficient compute to actually run them. Too many actors for your hardware will mean longer for individual games to finish and therefore your data could be more out of date with respect to the up to date checkpoint/weights.

Learner

The learner pulls trajectories from the actors and stores them in a fixed size FIFO replay buffer. Once the replay buffer has enough new data, it does an update step sampling from the replay buffer. It then saves a checkpoint and updates all the actor’s models. It also updates a learner.jsonl file with some stats.

Evaluators

The main script also launches a set of evaluator processes/threads. They continually play games against a standard MCTS+Solver to give an idea of how training is progressing. The MCTS opponents can be scaled in strength based on the number of simulations they are given per move, so more levels means stronger but slower opponents.

Output

When running the algorithm a directory must be specified and all output goes there.

Due to the parallel nature of the algorithm writing logs to stdout/stderr isn’t very useful, so each actor/learner/evaluator writes its own log file to the configured directory.

Checkpoints are written after every update step, mostly overwriting the latest one at checkpoint--1 but every checkpoint_freq is saved at checkpoint-<step>.

The config file is written to config.json, to make the experiment more repeatable.

The learner also writes machine readable logs in the jsonlines format to learner.jsonl, which can be read with the analysis library.

Usage:

Python

The code lives at open_spiel/python/algorithms/alpha_zero/.

The simplest example trains a tic_tac_toe agent for a set number of training steps:

python3 open_spiel/python/examples/tic_tac_toe_alpha_zero.py

Alternatively you can train on an arbitrary game with many more options:

python3 open_spiel/python/examples/alpha_zero.py --game connect_four --nn_model mlp --actors 10

Analysis

There’s an analysis library at open_spiel/python/algorithms/alpha_zero/analysis.py which reads the config.json and learner.jsonl from an experiment (either python or C++), and graphs losses, value accuracy, evaluation results, actor speed, game lengths, etc. It should be reasonable to turn this into a colab.

Playing vs checkpoints

The checkpoints are compatible between python and C++, and can be loaded by the model. You can try playing against one directly with open_spiel/python/examples/mcts.py:

python3 open_spiel/python/examples/mcts.py --game=tic_tac_toe --player1=human --player2=az --az_path <path to your checkpoint directory>