Sunday, April 7, 2019

Dave Computes a Simple Machine Learning Algorithm

Here's a relatively straightforward way to develop a bot that uses machine learning to discover optimal game play. The approach is appropriate for any turn-based game (with 1 or more players) that can be reduced to a manageable number of game states and possible choices. So far, I have used this algorithm with some success for multi-player Five-Card Draw poker and for the game of Coup.


Idea

Play a game billions of times. As you play, keep track of every game state reached, every choice made in those states, how often each choice was made, and how often each choice resulted in a win. When learning, use the data accumulated to make choices that maximize the chance of winning. When done learning, the recommended strategy consists of the choices that were made most often.


Details

1.  Identify every game state. A game state includes everything currently known to the player. In poker, this could include the player's hand and opponents' betting, but it cannot include opponents' unseen hands or the arrangement of cards in the deck. For most interesting games, a collection of game states would be too large to store in memory, or would be so large that finding an optimal strategy would be too computationally intensive. Therefore, much work goes into identifying which aspects of a game state to consider and which to ignore. For example, we might include a poker player's hand category (high card, pair, two pair, etc) but not their particular cards. Likewise, we might include the amount of money on a poker table, but not the details of who contributed that money and when.

2.  Identify all possible choices that a player might make from any particular game state. Again, some games may need to be simplified to keep this manageable (such as limiting how much a poker player can raise).

3.  For each game state and choice, store the number of times that choice was made and the accumulated gains resulting from that choice. For example, I might have encountered a particular state S on 100 occasions. On 30 of those times, I chose option C. In 20 out of those 30 cases, I went on to win the game. I would therefore record that state S and choice C resulted in 20 wins in 30 games. In the case of poker, instead of recording the number of games won, we would record the total amount of money gained as a result of a choice. For example, I might record that state S and choice C resulted in earning -12 dollars after 40 hands.

Let's call all this data our result table. It maps any game state and choice to the resulting number of wins and games.
<game state, choice> → <wins, games>
4.  For any particular game state, the expected value of a particular choice is the number of wins (or money gained) divided by the number of times that choice was made. This value is easily computed from the result table. For example, if state S and choice C resulted in 20 wins in 30 games, then the expected value for choice C in state S is 20 / 30 ≈ 0.667. Likewise, if state S and choice C resulted in earning -12 dollars after 40 poker hands, then the expected value is -12 / 40 = -0.3.

5.  In the course of learning, whenever a player must make a choice, use the result table to compute the expected value for each possible choice, and choose the option with the greatest expected value. For example, suppose I find myself in some game state, and I must choose from among options A, B, and C. I determine the expected value of A is 0.5, the expected value of B is 0.9, and the expected value of C is 0.1. I will therefore choose option B to maximize my expected value.

6.  During the course of each game played, record each game state reached and each choice made. For example, I might record that, when I found myself in game state S1, I made choice A, and when I then found myself in game state S2, I made choice B.
<game state, choice>, <game state, choice>, ..., <game state, choice>
If this is a game like poker, I must record the amount of money I had at the time a choice was made, so that I can later determine how much I gained or lost as a result of that choice.

Note that this list of choices made is separate from the result table. In fact, each bot using this learning algorithm must keep its own list of choices made, even though they might all share a single result table.

7.  After each game (or poker hand), when the outcome is known (who won the game, or how much each player earned), go through all of the choices made by a player and record the results of those choices in the result table. For example, suppose I reached game state S and made choice C at some point during a game. In previous games, I had made choice C in game state S on 9 occasions, and I went on to win the game on 6 of those occasions. Suppose I now win this particular game. I must increment both the number of games and the number of wins for state S and choice C, so they now show I went on to win the game on 7 of 10 occasions. On the other hand, if I had lost, then I would not have incremented the number of wins, and my result table would record that I had won the game on 6 of 10 occasions. Once I've recorded the outcomes of all choices made in the game, then I can safely discard the list of choices made and begin the next game with an empty list.

8.  In the course of learning, a player must sometimes make a choice with a lower expected value. This is because an issue arises if a player always makes the choice with the greatest expected value. Before the result table closes in on an optimal strategy, it will often guide a player to make a non-optimal choice. For example, suppose that in a certain game state, choice A's theoretical expected value is 0.8, while choice B's expected value is 0.5. Choice A is therefore the optimal choice in this game state. In the long run, we hope the data in our result table will converge to these expected values. But early in our learning, the expected values in our result table may be very different. Let's consider two scenarios.

In the first scenario, a moment occurs when choice A's expected value is 0.8 and choice B's is 0.9. Because B currently has the greater expected value, our player will keep choosing B. Eventually, we expect that option B will perform poorly enough that its expected value drops below A's (0.8), and the player will then begin to make the optimal choice. In this scenario, the algorithm has self-corrected.

In the second scenario, a moment occurs when choice A's expected value is 0.3 and choice B's is 0.9. Again, because B currently has the greater expected value, our player will keep choosing B. We expect that B's expected value will approach the theoretical value of 0.5. But in doing so, it may never drop below A's current expected value of 0.3, and therefore the player may never get the chance to try out option A again and discover that it will outperform B in the long run. Our algorithm has gotten stuck.

For this reason, it is necessary for the player to sometimes make a choice with a lower expected value. Perhaps 1% of the time, a player should make a random choice (instead of the one with the greatest expected value). In practice, 1% is a good probability to use. The higher this probability is, the less likely the algorithm is to get stuck. It will therefore converge more quickly, but to less accurate values. A lower probability of random play gives more accurate values, but is more likely to get stuck without finding an optimal strategy.

9.  Ultimately, the learning algorithm will most likely require billions of games before the result table converges to meaningful results. (Note that 32-bit integers are therefore unlikely to be sufficient for storing numbers of wins and games.)

10.  At the conclusion of the learning process, focus on the result table's numbers of games--not expected values. For example, suppose that, for some particular game state, our result table shows the following data:
choice A:  2 billion wins in 5 billion games (expected value = 0.40)
choice B:  5 million wins in 50 million games (expected value = 0.10)
choice C:  3.9 billion wins in 10 billion games (expected value = 0.39)
It would be incorrect to conclude that choice A is optimal just because it has the greatest expected value. Notice that the computer chose option C twice as often as it chose A, and that their expected values (0.40 and 0.39) are very close. Notice also that the computer chose option B far less often (in only 50 million games), and that B's expected value is much lower (0.10). What probably happened here is that the computer found that option B performed poorly compared with A and C. After a while, it stopped choosing option B (except on those 1% of occasions when it made a choice at random), resulting in a much lower number of games.

On the other hand, option C was used twice as often as A, and both resulted in similar expected values. Most likely, whenever the computer chose C a couple of times in a row, it learned to defend against that choice. This temporarily made A's expected value greater than C's, and so the computer chose A next. But then the computer adjusted to defend against A, making C the better choice again.

So, which choice does our result table recommend? The answer comes from the numbers of games. The computer reached this game state in roughly 15 billion games. It chose A in 5 billion of those games and C in 10 billion. Therefore, the optimal strategy suggested by the result table is for us to choose A with probability 1/3, choose C with probability 2/3, and never choose B. In other words, a mixed strategy is required. A pure strategy (in which the computer always makes the same choice) could be exploited.

11.  A game may have many optimal strategies (each of which may itself be a mixed strategy). Suppose we run our software once, and it recommends one strategy, and then we run it again only to find it recommends a different strategy. This might mean that we didn't run the software long enough for it to arrive at an optimal strategy. But it might simply mean that the software converged to two different and equally optimal strategies.


Example

Let's see how this algorithm is used to solve a toy version of poker. In this 2-player game, each player pays a $1 ante. Each player is dealt an ace (highest), king, or queen (each equally likely). Note that both players may be dealt the same card value. Players take turns acting by folding, calling/checking, or betting/raising in increments of $1, with a maximum of 2 raises. A couple examples illustrate how the game is played.

Example #1. After antes, there is $2 in the pot. Player 1 is dealt an ace and player 2 is dealt a king. Player 1 checks. Player 2 bets. There is now $3 in the pot. Player 1 raises. There is now $5 in the pot. Player 2 folds. Player 1 keeps all $5. Player 1 has gained $2 and player 2 has lost $2.

Example #2. After antes, there is $2 in the pot. Both players are dealt a king. Player 1 bets. There is now $3 in the pot. Player 2 calls. There is now $4 in the pot. Both players show their cards. Because they hold the same cards, they split the pot, each collecting $2. Each player has gained $0.

Let's identify the game states. A player may hold one of three possible cards:  A, K, or Q. They may be asked to act after any of 6 possible betting histories:

History  Meaning
         no actions have taken place yet
c        player 1 checked
cr       player 1 checked, player 2 bet
crr      player 1 checked, player 2 bet, player 1 raised
r        player 1 bet
rr       player 1 bet, player 2 raised

With 3 cards and 6 histories, there are a total of 3 × 6 = 18 game states.

When it's their turn, a player may make one of three choices:  fold (f), check/call (c), or bet/raise (r). Not all choices are available in all game states. A player can only fold in response to an opponent's raise, and a player cannot raise if two raises have already occurred.

Initially, our result table stores a 0 for the number of games and amount won in response to each choice. After thousands of games it might look something like this:

Game State  Choice  Games   Won
...
K           c        2000  1600
K           r        1000   900
...
Krr         f         300     0
Krr         c          10    -3
...

Suppose I have $10 in my stack. I pay a $1 ante, leaving me with $9 in my stack. I'm dealt a king, and I'm the first player to act. I find myself in game state K (representing a king and an empty betting history). I can check or raise. The result table above shows the expected value for betting (900/1000 = 0.9) is higher than the value for checking (1600/2000 = 0.8). I therefore bet. I record the game state (K), my choice to bet (r), and the amount in my stack at the time (9).

<K, r, 9>

Now suppose my opponent raises. I now find myself in game state Krr (representing a king, a bet, and then a raise). I now have $8 (having just bet $1). I can fold or call. Normally, I'd consult the result table, which tells me that folding (0/300 = 0.0) is preferable to calling (-3/10 = -0.3). However, we'll suppose this is the 1% of the time that I make a random choice instead of following the table. At this time, I randomly choose to call. I record the game state (Krr), my choice to call (c) and the amount in my stack at the time (8).

<K, r, 9>, <Krr, c, 8>

I now have $7 (having just called $1). Suppose my opponent reveals a queen. I win the $6 pot, which means I now have $13 in my stack.

Now I go back and update the result table. When I bet in state K, I had $9. As a result of that choice, I wound up with $13. Betting earned me $13 - $9 = $4. Previously, that choice had earned me $900 in 1000 games. Now, it's earned me $900 + $4 = $904 in 1001 games. Likewise, when I called in state Krr, I had $8. I wound up with $13, so calling earned me $13 - $8 = $5. Previously, that choice had earned me $-3 in 11 games. Now, it's earned me $-3 + $5 = $2 in 11 games. I therefore update the result table as follows.

Game State  Choice  Games   Won
...
K           c        2000  1600
K           r        1001   904
...
Krr         f         300     0
Krr         c          11     2
...

It took about 3 hours for my learning software to run for 10 billion games. Here are the results, presented in a more compact format. The values show the amount won divided by the number of games played, for each valid choice. The result of that division would give the expected value. (M signifies millions and B represents billions.)

Game State  Fold    Check/Call  Bet/Raise
Q                   699M/3.2B   29M/164M
K                   2.5B/3.3B   44M/61M
A                   5.6B/3.1B   361M/206M
Qcr         0/1.4B  -2.4M/4.7M  -3.4M/4.7M
Kcr         0/546M  1.2k/937M   -3.1M/5.0M
Acr         0/4.8M  7.5M/4.8M   2.2B/1.4B
Qc                  705M/2.0B   394M/1.1B
Kc                  3.2B/3.2B   15M/18M
Ac                  26M/15M     6.0B/3.2B

Qcrr        0/394M  -2.0M/2.0M
Acrr        0/5.2M  2.1B/1.0B

Using the strategy suggested here, game states like Ar and Kcrr are never reached and have been omitted from the results above. (A careful study shows that player 1 never begins with a bet, and player 2 never bets a king.)

The boldfaced data above show which choices we should make for correct play. The suggested strategy is summarized in the following table.

Player  Card   Play
1       queen  check+fold
1       king   check+call 2/3, check+fold 1/3
1       ace    check+raise
2       queen  check 2/3, bet 1/3
2       king   check
2       ace    bet

It shows that player 1 should always begin by checking. If, for example, player 2 responds with a bet and player 1 has a king, then player 1 should call with probability 2/3 and fold with probability 1/3.

In the event that player 1 deviates from correct play by leading off with a bet, then player 2 should fold a queen, fold (1/2) or call (1/2) with a king, and raise an ace. If raised, player 1 should fold a queen but call with a king or ace.


Background

My earlier exploration of Counterfactual Regret Minimization (CRM) led me to the much simpler idea I presented here. Also, a weakness of CRM is that it requires a special implementation of the game software during training, whereas my approach allows a bot to train without modifying the game software. I suspect that the results of my approach are equivalent to CRM. It would be worth exploring ways to speed up learning, so that my algorithm approaches an optimal strategy in fewer games.

No comments:

Post a Comment