From TD-Gammon to Race for the Galaxy
Temporal Difference Learning for Boardgame AI
What makes a game replayable over time? It offers new challenges over and over again. One way to do that is to include an AI opponent that is so skilled, even advanced players will continue to be challenged after hundreds of hours of play. Race has been one of the top selling boardgames this year partly because of the neural network that powers its AI. Race for the Galaxy uses a temporal difference neural network. This knowledge-free system requires no human input to generate training data, which makes it efficient for a small team with limited resources. Instead, it learns by playing randomly, making predictions at the turn level on which player is winning, and updating the weights in its multilayer perceptron architecture such that the delta between predictions from one turn to the next is diminished. Using this method on over 30,000 training games, it’s learned the black box function that best represents the relationship between input (the state of the game) and output (prediction of who’s winning) for the neural network that drives our AI. It’s a pretty incredible method, first used for boardgames by Gerald Tesauro who created TD Gammon and now applied to Race for the Galaxy by our AI expert, Keldon Jones.
Heuristic AI At Temple Gates, we started making boardgame AIs with Ascension. In this game, we used heuristic driven AI. This means the AI favors particular moves that we deem superior. For example it buys cards with a high VP to cost ratio or it trashes low value militias before other cards. These heuristics are defined by human experts. The AI moves are based on what we think are the best plays. And it’s pretty good. But an expert player will eventually win frequently against this AI.
Keldon Client As any devoted boardgame junkie would, sometimes I Google my favorite tabletop games to see if anyone has digitized them. A few years ago I Googled my favorite game, Race, to see if a digital version existed and I found one! It was an unlicensed project developed by Keldon Jones. I clocked hundreds of games in it, and wanted just one feature: a sound to alert the player that it’s their turn. Fortunately it was open source, so I could start tinkering under the hood, and there was some neat stuff happening there. This client had a world class AI. I thought about all the other features I wanted and feature creep city, eventually I wanted it on my phone.
Licensing Back to Google, I looked up Race’s designer and got in touch with Tom Lehmann to see if that was something I could make happen. Tom had one request, though. He wanted to make sure Keldon could be involved in the project. Apparently he had a pretty neat AI!
TD Gammon This AI was based on the original research for TD Gammon by Gerald Tesauro. Yep, one of the most cutting edge game AIs ever created is based on a 5000 year old game: Backgammon. Tesauro pioneered a neural network driven AI that ultimately changed conventional wisdom on how Backgammon should be played. This AI successfully developed on its own unorthodox strategies that were eventually adopted at the highest tournament levels and are now widely accepted by players around the world. For example slotting vs. splitting. Traditionally, players whose opening roll includes a 1, use the 1 to move from position 6 to position 5. This is a risky move with a high reward, if the opponent doesn’t bounce you out, they’ll have a much harder time escaping the 1 position later. But TD Gammon, based on a neural network, found higher success rates over thousands of games by “splitting” rather than “slotting”, and moving one checker off of 24 to 23.
It doesn’t exactly matter why it’s a better strategy. It simply has a higher win rate. Both the predictions and the outcomes favored splitting.
How did this AI discover the superiority of a strategy that humans hadn’t in 5000 years? The AI starts with zero initial knowledge. So unlike our heuristic driven AI in Ascension, it doesn’t come pre-populated with imperatives based on human expertise, in essence, it’s teacherless. And that’s good, because human expertise can be fallible..
Formula TD Gammon uses a neural network to determine its moves. It yielded a formula for changing the NN weights every turn, to reduce the difference between the current and previous turn’s board position. This is temporal difference learning.
is the amount to change a weight from it’s value on the previous turn. is a learning parameter. is the difference between the current and previous turn board evaluations. is the rate of decay in back propagating older estimates and is how much changing the weight affects the output. We use a similar formula in Race for the Galaxy.
Training Data? Neural networks typically get better by adjusting their edge weights so that the computations result in the inputs and outputs better matching the training data. Training data usually comes from feeding expert data into the neural network.
If you want a neural network to be able to identify photos of cats, first you feed it a bunch of photos that a human says are cats. This is expensive. It requires human labor. Keldon’s AI does something different, and this is based on that teacherless system pioneered with TD gammon.
Reinforcement Learning So we integrated it. We use what’s called a “knowledge free” system. Where do we get our training data? The neural network actually generates it by playing itself. The AI is fed input, produces output, and receives a reward based on feedback signal. The feedback is whether the move resulted in a winning game board, but how do you get feedback from turn to turn, when you don’t yet know if the game was won? The reward at the end of the game is delayed. The solution: A temporary credit or blame is assigned at each turn leading up to the final reward at game end. This is the crux of the Temporal Difference method.
Temporal Credit Assignment This style of reinforcement learning is based on temporally successive predictions. What I mean by that is, turn n it trains turn n-1. And this back propagates, with decay. So if it’s making a prediction of player 3 winning on turn 2, but then on turn 3 it makes a prediction of player 5 winning, it trains itself that on that state on turn two, it should have skewed toward player 5 – it adjusts its weights. At the end of game, rather than using prediction, it does use actual winner as training data, but It’s training every time step or turn it gets run. Each opponent effectively pushes the NN. 30,000 games x 4 players x # turns is about a million Time Steps its trained on.
Escape Bias Using this knowledge free system frees us from relying on a human teacher. The AI only needs to know the rules of the game. This is interesting for two reasons. 1) It keeps our costs very low. We can get a ton of training data with virtually no human expense. 2) It frees us from human bias. This has been a big controversy for AI’s recently.
Tactics Crystallize What’s happening here is that the initial strategy is random. During first thousand training games, tactics emerge such as Piggy Backing. Keldon’s AI have developed Piggy Backing strategies, completely independent of human input.
For example, here the AI chose Trade, even though it has no goods to trade. I fell into the trap, chose settle, the AI can settle a world that comes with a good, effectively piggy backing my settle, and trade it. The tactics employed can be nuanced and context sensitive, unlike a heuristic driven AI.
Inputs In our neural network, what are the inputs? These are 800 nodes including a ton of game state data.
AI Bifurcations In fact these inputs don’t just feed into one neural network. Keldon bifurcates the model. One of the most important things is deciding how to architect your bifurcations. We have twelve unique models of neural networks each trained for a different set of expansions and player count. If you’re running a two player game, the AI is on a different network than a three player game. We could always partition further, but there are diminishing returns and you’re burdened with complexity and size – which is bad. The NNs are one quarter our download size, which is pretty nutso given the volume of art we ship with from all the game cards.
And for each bifurcation there are actually two flavors of neural networks at work, each with it’s own main function. So really we ship with 24 neural specialized neural networks.
Outputs These functions determine the outputs. The functions are ‘What move would player X make in state Y?’ We call that function Eval_Role. The second function is given this board, score it: tell me how good it would be for me on a scale of -1 to 1. We call this Eval_Board.
Simulating Forward On its turn the AI simulates through every possible move it could make, and it runs the function called Eval_Board on the results and chooses the best one. In fact, to move the simulation forward a step could involve one or more calls to Eval_Role to guess opponents role choices.
Tuning Difficulty The result of all this is that the AI is really hard. Or at least it can be. You might think a hard AI is not ideal for some people, maybe new players. And you’re right! It’s very difficult to make an AI harder, but you can nerf an AI to be easier pretty simply. In our neural network outputs, we add noise to the score. Increasing the noise will make the AI choose the 2nd or 3rd best options. You can tune the amount of noise you add to find the right difficulty settings. When determining the noise for a medium AI, you want it to win roughly 25% of games against a hard AI.
TD NN Candidate Checklist The main thing is, you can get an AI that can be very skilled, which means can satisfy and challenge players for hundreds of hours. And that’s good! But not all games are suitable for temporal difference learning AI. I’ll leave you with a little checklist of the kind of games that will benefit from a temporal difference NN.
- Termination The game needs a definitive end, so that the final reward signal can propagate back to the previous turns. Chess is a bad candidate because it can result in a stalemate dance between moves, and never generate this final reward signal.
- High stochasticity A broad possibility space opened up by branching probabilities gives this type of NN space to discover novel strategies. Lots of RNG is great, so deck shuffling and dice rolling work well.
- Non-spatial Your NN is trying to learn a function that it doesn’t know yet. A function is easier to learn when it’s smooth. A small change to the input, should result in a small change to the output. If you have a chess arrangement and make a small change to one piece’s position, that could result in a wild change to the probability of winning. For this reason, games where relative arrangements are critical are bad candidates for TD NNs.
- Fixed number inputs Your AI learns best when the inputs are fixed. This is why we branch our NNs when we add a new expansion. The AI is most efficacious when the incoming variables are the same every time it plays.
- Multiple turns While it’s theoretically possible to make a TD NN to help with a co-op game, you wouldn’t want to choose a game like One Night Werewolf. That’s because Werewolf resolves the entire game in one turn, so there could be no backward decaying reward for the nn to learn from. Additionally, the acumen on a TD NN drops off toward the end of the game because it’s benefiting from fewer forward rewards.
Up Next? We’re investigating adding a NN driven tree search to our Roll for the Galaxy AI for better performance, based on the recent publication about AlphaGo Zero that extends the NN technique to handle deep search trees. Right now with Race we do a full 2 ply lookahead, but the branching factor of Roll seems higher and might be too much for phones. We’ll see.
For games like Race, it’s often the case that simple mechanics produce large search spaces as well. The most recent expansion for Race, Brink of War, occasionally results in game states where over 500,000 neural networks evaluations are needed to advance the game a single turn. To overcome this, we currently heuristically discard low probability branches of the search tree. In the future we could use the techniques from AlphaGo Zero to prune those evaluations and improve performance, especially on platforms like phones and tablets.