An application to the Connect 4 game.
Why connect 4?
When I decided to learn about Reinforcement Learning, I thought that I could begin with Generals.io. After all, it seemed quite easy to build unbeatable AI in Chess, Go and DOTA II. How could I fail to do the same?
Being a noob at both Neural Networks and Reinforcement Learning, I finally decided to start small: connect 4. First thing I was told?
Why connect 4? You can easily brute force it!
True. You can find a complete guide here. But that’s not the point. I wanted to try, fail and eventually succeed in building a learning AI.
What is the famous Q-learning?
Q-learning is the first algorithm given back by Google when looking for reinforcement learning. I won’t explain the whole concept, you can find a great introduction here. I’ll just give a quick summary here.
Q-learning is a solution to a MDP, aka Markov Decision Process. A MDP is defined by an agent, performing in an environment by means of actions. The environment can be in different states.
Given those 3, the idea of Q-learning is to reward the agent if the action it took led him to a better state. Or punish it if not. Those rewards will be used to build the function Q which is an estimate of the final gain, given the state s and the action a.
To ensure that a play is considering the potential future outcome of a game, the factor gamma is used to leverage the maximum Q-value of the next state s+1 among the possible actions.
Temporal Difference Learning
In connect 4, it is not possible to know if an action was a good move or a bad one. You have to wait until the end of the game to know if the succession of actions you took led you to the victory. It is the idea behind Temporal Difference Learning (TD-learning): the agent gets a reward when it wins the game, this reward is then distributed to the previous actions, with a discount.
With Q-learning and TD-Learning combined, each time the agent is rewarded — or punished — the function Q is updated with the following algorithm:
After several iterations, you learn the function Q (got it?).
We need to go deeper!
For simple games (a few actions, a few more states), you can evaluate the exact function Q as a 2D array. For more complex games, you have to approach the function Q. This is where neural networks come into play. They can be flexible enough for this task.
The neural network will take the state of the environment as an input and will output an array corresponding to the available actions.
Your mission is to train the neural network to approach the Q function that will make the best choices.
The question is now: how to put this into use for connect 4?
Markov Decision Process
For connect 4, the environment is the board. The states are the different possibilities of this board being filled with yellow and red chips. There are at most 7 actions: drop a chip in each of the 7 columns.
What is my “labeled” data?
I don’t have a dataset of games played by humans. The idea of an AI learning just by playing against itself was kind of sexy so I decided to go for it: the training AI is playing successively as player 1 and player 2, saving the board states and actions. The more it learns, the more interesting the games will be.
To be sure the final model has explored enough in the game, there has to be a big part of randomness: I implemented the classic ε-greedy exploration strategy.
I first decided to give a reward of +100 to the agent for winning the game. I trained my network and then played against it. This is when I realized it was not trained to prevent its opponent from winning, just to win when it could. For most cases, when a player wins, either the opponent could have prevented it, either it allowed the winner to win. For the former, I give the agent a +75 reward. For the latter, a punishment, that is to say a -75 reward.
Those rewards are completely arbitrary, I haven’t found a general law for picking them.
- A dense neural network with SynapticJS (connectd to the 42 slots of the board).
- A convolutional neural network (CNN) with ConvNetJS.
I had the intuition that a CNN would be quicker to train and smarter, for this reason: consider the following situation representing 2 boards for 2 different games.
Playing as the yellow player, a person would have the same reaction, regardless of the board: win. On the opposite, a dense neural network would see 2 completely different situations and would need to learn to play the same way.
I had the sense that a convolutional neural network would detect the same pattern and activate the same way the right column. No need to learn two times the same information.
My repository is available on GitHub. I created a few APIs in order to:
- Interface with the neural networks whichever the chosen implementation: NeuralNetwork;
- Play a whole game of connect4: Game;
- Train the network using Q-Learning: QLearning (see file below);
They are completed with a Helper file and an index.js file to launch the training:
To reach the best level, I trained the same network several batches of around 200k games. Globally, I started with a lot of exploration (9/10 of the plays are random) to reach an even ratio at the end of the training — around 800k games played. On every batch, the learning rate was decreasing from 1e-5 to 1e-7.
The training is executed with node on a single CPU and takes at most 1 hour per batch for the dense neural network and a few hours for the CNN.
I like to say that the trained network has reached the level of a young teenager that has played quite often.
How did I measure the convergence?
To be sure my model was correctly learning, I decided to regularly use the model to predict its next play on a board where it could obviously win. I used 3 situations where it could win: horizontally, vertically and diagonally. Here is a chart displaying the typical convergence I observed.
Try it yourself!
I learnt a lot doing this side-project: experimenting things instead of overthinking them; being patient during a training; dividing my work to progress a bit everyday.
I didn’t meet the results I hoped with the CNN. It was quite frustrating: both networks have more or less the same level, and the CNN training takes way more time. I now need to look deeper into convolutions and understand how to improve them.
What I haven’t experienced
There are a few things that I didn’t try on this project. If I had more time, I would explore one of them. If you are interested, here is the list:
- Benefit from the symmetry of the board (both playing and learning)
- Train my AI against a brute force/exploratory model (Monte-Carlo, minimax)
- Train my models against the first ones (for faster training)
- Conduct a hyper-parameters optimization
- Define a true performance measure of my AI “smartness”
- For the CNN: Analyze the network to follow activated neurons and see the convolution filter that activated it.
- For the CNN: Change the parameters of the filters (padding, size, stride)
Try not to lose!
Make your own mind playing against my connect4 AI on Sicara’s website! You can see its limitations but hey, it’s learning live! And it’s my first baby!
Thanks to Damien Peltier and Adil Baaj.