What to expect from AlphaZero's value predictions [D]

What to expect from AlphaZero’s value predictions

An agent trained with AlphaZero has learned to predict the value of a game state by training on data generated through self-play and predecessor models. By construction, this value is supposed to reflect the probability of winning against a copy of itself starting from the given state. More precisely, the value measures the average strength against opponent players collected among all predecessors of the current model. This average depends on how the training data is sampled (using a rolling window of self-play by the latest x models with geometric weighting for more recent models).

In each round of self-play, one can think of agents following strategies defined by the PUCT function based on predicted values and policies, but these strategies are slightly perturbed by adding some proportion of Dirichlet noise. The purpose of this noise is to allow the model to find successful actions through chance rather than getting stuck in rigid patterns.

Given that the data used for value predictions includes “outlier” moves, it’s argued that AlphaZero bases its predictions on experience playing against a variety of different opponents. However, because outlier moves have little impact on value predictions, the agent’s own playing style and historical development govern these predictions.

If an AlphaZero agent encounters a strong opponent-either a human or another algorithm with a robust track record-why should we expect its value prediction to be a reliable measure of the agent’s chances of winning from that position?

Experience has shown AlphaZero to outperform both human players and other algorithms in various games. Is this success also expected beforehand, or is it conceivable that AlphaZero could fail miserably against a specific algorithm whose moves appear infrequently in its training data?

Key Takeaways

The value prediction reflects the average strength of opponents from all predecessors of the current model.
This value is influenced by recent models through geometric weighting, which emphasizes more recent self-play data.
Outlier moves have a minimal impact on the value predictions, as they are considered less significant in determining the overall strategy and style of play.
The success of AlphaZero’s value prediction depends on its ability to generalize from historical games rather than relying solely on recent self-play data.

These points highlight the nuances and limitations of using AlphaZero’s value predictions as a reliable measure in different scenarios.

[view comments]

Source Read original →

What to expect from AlphaZero’s value predictions [D]