Slime vs Me, Myself, & I

23 min readDec 10, 2020

Team Members: Bala Balasubramanian, Anissa Chan, Brian Cheung, Sammy Chien, Iris Ham, Simon Hoque, Kunal Jain

https://github.com/simon-th/dsl-final-project

Background

A core tenet of early behavioral psychology is B.F. Skinner’s Operant Conditioning [5]. In his experiments training rats to press a lever for food, he theorized that “The consequences of an act affect the probability of its occurring again.”

Reinforcement learning is based on a similar principle of giving consequences for actions. The goal of reinforcement learning is to get an agent to learn for itself (through trial and error) what the best strategy or decision making processes are in an environment. In the following experiments, we used OpenAI Gym to provide an environment from which to learn from.

OpenAI Gym

OpenAI Gym is a platform often used to create, train, and test learning agents and develop reinforcement learning algorithms. We chose this open source library to explore reinforcement learning due to its wide collection of standardized environments and its many advantages, including the possibility of general algorithms, the common interface between the environments, its compatibility with computational libraries, and the trainability of the agent due to its unassuming structure.

Gym follows an agent-environment loop where an agent acts upon an environment which then returns an observation and a reward based on that specific action. The actions, observations, and rewards can be modified to observe how the agent learns to interact with its environment. The diagram below illustrates this arrangement (Reference 1).

Simple RL Algorithms

The CartPole OpenAI Gym environment is a primitive model that is easy to understand, and yet complex enough to test our understanding of both basic and sophisticated reinforcement learning algorithms.

CartPole Gym Environment

In the CartPole environment, we are tasked with keeping a pole upright as it sits in a cart along a frictionless track. The agent is someone who applies forces left and right to the cart. The action space is discrete, 0 representing applying a left force on the cart and 1 representing applying a right force on the cart. The state space is a vector of 4 floats representing cart velocity, cart position (along x-axis), pole angle, and pole angular velocity. The state is initialized with random numbers from -0.05 to 0.05. In this game, you are given a reward of 1 for each time step the game continues. Each episode of the game ends once the angle of the pole reaches 12 degrees or the cart goes to the edge of the screen.

Given a state, we need to assign approximate values to the expected returns from each action so that we can make the best possible decision. A control function, denoted as Q(state, action), is a representation for these state-action pair values. Once we have this representation, we can determine a policy π denoting a set of rules representing the best next action an agent can take in a given state.

If we take on a random policy, i.e. π is defined by choosing an action uniformly at random at each time step, we are able to get an average reward of 22.24 over 10 episodes. A more logical policy would be to take the best action given a state.

The control function Q(s, a) can be represented in a tabular format, such as a 2-D array, when the state and action spaces are finite and reasonably small. However, in our CartPole environment, every number along the x-axis is a possible position for our cart. Similarly the range of values for the angular velocity is continuous, making it impossible to store a table large enough to fit all possible state combinations. To remedy this, we can approximate Q for a state-action pair using a function. We used a linear function to approximate Q, which was essentially a linear combination of the features of a state. The coefficients of our linear combination were the weights that we needed to find that best approximates Q.

Our weights were updated using an algorithm called Q-Learning.

We needed to update our Q function based on some true value. In traditional supervised learning, we are given the true labels of data, and must calculate the loss of our predictions with respect to the labels given. Since we did not have the true Q function, we needed to use an estimate for what we believed the Q function was given a state and an action, or the temporal difference target in the image above. We used our estimated value of the best action taken at the next state as our “true label”. Fundamentally, we were bootstrapping our previous estimates to come up with new estimates.

https://github.com/kunalJa/slimevolleygym

Notice that we also had to take the norm of the gradient. We had an issue with exploding gradients, where all the weights became infinite and NaN. To eliminate this problem, we clipped our gradients by scaling our gradients down whenever the norm exceeded 10, which was an arbitrary value that we picked.

We trained our linear function approximator with a learning rate of 0.01 and a gamma (discount factor) of 0.99. The graph of the total reward gained during training can be seen below.

Evaluating this model over 10 episodes gave an average reward of 39.3. This model did slightly better than the random policy. We simply took the final weights as our model. Another strategy might be to take the weights which provided the highest reward. It is not simple however, as the reward gained from that episode of the game may not translate well to a different episode of the game where the starting angle and velocity of the pole are different.

Since there are nearly an infinite number of starting positions, learning to remain upright for one position will most probably not help with other positions. Our agent needed to explore as many possibilities to come up with weights that will generalize well. The issue boiled down to a tradeoff between exploration and exploiting its current policy to maximize the rewards. Such a tradeoff is known as an ε-greedy strategy, in which the agent decides to explore (take an action different than that recommended by the policy: random) with ε probability. ε decreases over time as we hone in on a policy, allowing our model to have a better chance of converging towards a model that can perform given any starting state.

We trained the model with the same hyper parameters (linearly annealing ε from 1 to 0.05 over 1000 steps) for 5000 episodes.

To improve our model, we experimented with experience replay. The idea behind experience replay is to recall what we learned in the past. We implemented this by storing the last 50 states in a deque and sampling a batch of states at each time step. The reason behind this idea is to reduce the correlation between states in close proximity. In addition, we used a target Q network that freezes the weights of our target Q value periodically so that our estimates are not always chasing a moving target. After implementing these concepts, we can see the improvement in our agent in the graph below

Another way to approximate the Q(s,a) is to use a deep neural network. Deep neural nets are able to approximate nonlinear functions and capture more complex relationships. We built a deep Q-network (DQN) in PyTorch where we updated the model with the same td_target_q from the code above.

https://github.com/kunalJa/slimevolleygym

What we mean by this, is that we use the same calculation for the control function, but let PyTorch figure out the gradient and loss itself. In this scenario the state_dim is the dimension of the state space of the CartPole environment. The action_dim is the dimension of the action space of the CartPole environment. Thus our model takes in state and outputs a set of values for each action just like our pure Numpy implementation of the LFA. This model uses ε-greedy as well.

We trained our DQN model for 1000 episodes with the same hyperparameters. From the graph of the total reward in each episode of training, we can see that the model was able to perform fantastically in some cases. It was able to reach the maximum reward of 500 many times. Unfortunately, the model seems to oscillate between 500 and 9. When evaluating the model using the final weights, we received an average score of only 32.9.

We don’t feel this is representative of the model’s learnings. Perhaps the model is extremely overfit to handle one particular type of initial state or we are exploring more than we should be. In an effort to limit exploration, we reduced the learning rate from 0.001 to 0.0005. Unfortunately this model also oscillates in a similar way.

The agent seems to “forget” what it has learned and drops back to receiving the lowest rewards. We researched this issue and came across the term “Catastrophic forgetting”. We tried different hyper parameters but were unable to get the DQN to converge on a winning strategy.

Slime Volleyball Gym Environment

The Slime Volleyball Gym Environment is an environment built on top of OpenAI Gym, where an agent learns how to play the game of Slime Volleyball. The game consists of two “slimes”, each represented by a half-circle, that plays a game of volleyball in a 2D space. The action space of each agent is a set of 6 vectors of length 3. In the vectors, each element corresponds to an arrow key (up, left or down), and each vector corresponds to a certain movement that the slimes can make. For the purpose of this project, we used the state-based observation space, which consists of a vector of size 12, consisting of the x and y coordinates, as well as the x and y velocities for the agent, ball and opponent.

Trying the Simple Algorithms on the Slime Volleyball Environment

We implemented LFA in the slime volleyball environment. We initially trained against a pre-trained model, but that model was far too good for our agent to get any reward. We changed to using a random agent (in blue above) allowing our agent to earn some points. The model acting above was trained for 36 hours for more than 900,000 episodes. Unfortunately, though our yellow slime agent learned to move towards the ball, it was unable to understand how to return the ball successfully.

This environment is much more complicated than the CartPole environment above. As such, we need more powerful methods of learning. After implementing some simple RL algorithms, we experimented with augmenting pre-trained models in the Slime environment.

Augmenting Slime Volleyball Environment

Wrappers

We conducted several experiments to explore the results of training a reinforcement learning agent in different scenarios. To do this, we augmented the Slime Volleyball’s default state-based environment to modify how the models observed the environment, performed their actions, and received their rewards. In order to do this, we created wrapper classes built on top of OpenAI’s environment class. This allowed us to build on top of the environment’s existing functionality without modifying the original source code.

For our experiments, we wrote classes that extend the ObservationWrapper, ActionWrapper and RewardWrapper classes built into OpenAI Gym. The default `step()` function in the Environment class returns the action, observation and reward of an agent in a timestep. We overrode this function to the default environment’s action, observation and reward. Then we wrote additional functions that added to those values before returning them from within our overridden `step()` function. For each experiment we conducted, we wrote a separate wrapper class that either augmented the action, observation or reward.

Training

After creating our augmented environments, we trained an agent to play Slime Volleyball in each environment. Our goal was to determine if training agents in our augmented environments had an effect on the agent’s final performance. We used the three base agents from the slime volleyball repository as the base models — each base agent used a different learning algorithm:

Proximal Policy Optimization (PPO) with Message Passing Interface (MPI)
PPO self-play.
Genetic Algorithm (GA) self-play

We trained each pretrained model for additional iterations in each augmented environment: an extra 100,000 tournament cycles for GA, and an extra 2,000,000 time steps for PPO. We also created “extended” versions of each pretrained model by training them for the same number of iterations in the default environment. These extended models allow us to compare the new models to models that were trained with the same number of total iterations so that we can identify how significant the additional iterations or the augmented environment affected the final model’s performance.

PPO MPI and Self-play

The PPO models used the Stable Baseline library’s PPO implementation. PPO is a policy optimization algorithm, so we wanted to see if the agents would learn a policy that resulted in high scores when tested in the default environment. We trained the PPO models both using MPI and self-play. Self-play refers to a learning algorithm where the agent plays against itself to improve performance. In self-play, as the agent improves at game performance, its opponent (a relative of itself) also continues improving as well. Algorithms that don’t involve self-play always play against an opponent whose model parameters stay constant. MPI refers to a message-passing standard that makes it easier to utilize parallel computing. The PPO-MPI learning algorithm is not self-play, but utilizes parallel architectures to enhance training speeds.

GA Self-play

For each augmented environment, the remaining model was trained using a Genetic Algorithm. Genetic Algorithms use a multi-layered perceptron (MLP) as its agent’s model. Our extended learning algorithm is as follows:

Start with a single MLP with known weights.
Create a population of related agents by adding some gaussian noise to each weight parameter. In our case, our population had 128 related agents.
Play a total of N games. For each game:
Randomly choose two agents in the population and have them play against each other.
If the game is tied, add a bit of noise to the second agent’s parameters.
Otherwise, replace the loser with a clone of the winner, and add a bit of noise to the clone.

To determine which algorithm is the current best, we track the winning streak (or how long that agent has survived) of each agent. The winning streak metric is a proxy for the performance of an agent within its population. This metric saves a lot of time when it comes to determining the best agent within the 128-agent population.

Evaluating the Environments

After completing the additional training on all three pretrained models in the augmented environments, we evaluated the new agents against five different agents: baseline (an RNN), a PPO pre trained against baseline, a PPO agent pre trained on self-play, a GA agent pre trained on self-play, and a random agent which chooses actions randomly. We measure the models’ performance by taking the mean of the cumulative scores (score of agent minus score of opponent) and standard deviation of the cumulative scores over 500 trials. A positive score means the agent performed better than the opponent. We discuss our different experiments and their corresponding results in the sections below. You will come across tables where the rows represent the trained models and the columns represent the opponent.

Experiments and Results

Modifying Rewards

Spike Experiment

We had two experiments that modified the reward values with RewardWrapper: Spike and Arc.

The Spike experiment added a constant reward whenever the agent performed a “spike”. A spike meant that the agent jumped up and hit the ball so that the ball moved with high velocity towards the opposing side, ideally horizontally or downwards. A Spike in this context is analogous to a spike in volleyball, in which a player tries to hit the ball high and fast so that the opponent has a small chance to return the ball. The goal of this experiment was to try and incentivize the agent to perform these spikes, which would hopefully improve its overall performance.

Some of the results we obtained from this experiment is shown in the tables below. The rows represent the trained models and the columns represent the opponent.

Compared to the control agents, which were additionally trained the same amounts of timesteps and tournament cycles, but without the spike RewardWrapper, only the PPO self-play Spike agent performs noticeably better than the PPO self-play extended agent when playing against the base PPO agent (it scores 4.71 points against its opponent on average). We surmise that this improvement in performance stems from two factors: First, the PPO self-play spike agent plays more aggressively, since it’s trying to hit more spikes. Secondly, the base PPO agent is simply worse at playing defensively, and cannot handle the aggression. An example of this behavior is shown in the gif below. The yellow agent is the PPO self-play Spike agent, while the blue agent is the base PPO agent. The yellow agent hits a spike towards the blue agent, who cannot adapt in time to recover the ball.

At the same time, this PPO self-play spike agent performs much worse against the base GA self-play, scoring -3.075 points on average. It’s not exactly obvious why this is the case, but one explanation for this poorer performance is that the spike trained agent is more aggressive to the point where it neglects careful defensive playing. The base GA self play agent is already one of the best-performing agents, and it’s likely that it has good survivability and makes mistakes rarely. Since aggressive agents like the spike-trained PPO agent can make mistakes more often due to the riskiness of committing to a “spike”, a good defensive agent will simply lose less, and perform better on average. An example of this is shown below. The blue agent is GA, while the yellow agent is the PPO self-play spike agent. In the two matches, yellow makes a mistake and loses a point since it tries a bit too hard to make the ball move low and horizontally, like a spike. On the other hand, the blue agent isn’t trying to make spikes all the time, and is focused more on surviving. So in this case, it makes sense why the PPO self-play spike agent would perform worse than other agents trained similarly long.

Unfortunately, all of the other comparisons between the agents trained to spike and the agents trained to the original environment were inconclusive. The remaining results are shown in the tables in the appendix

Arc Experiment

The Arc experiment added a varying amount of reward whenever the agent hit the ball in an arc. We defined that to be any shot hit by the agent at an angle between 30 and 60 degrees towards its opponent. We gave the agent the maximum reward when it hit the ball at a 45 degree angle at a high velocity. At other angles, we reduced the reward slightly depending on how far it was from 45 degrees. If the agent hit the ball at a low velocity, i.e. less than a given threshold, we halved the reward. The goal of this experiment was to incentivize the agent to hit the ball at high velocities at angles that would make it go further. Like the Spike experiment, we hoped this would improve the agent’s overall performance.

Some of the results we obtained from this experiment is shown in the table below. The rows represent the trained models and the columns represent the opponent.

Like the Spike experiment, the PPO self-play agent in the Arc experiment also performed noticeably better against the pretrained PPO agent compared to the control agent. In this experiment, the aim was to train the agent to hit the ball as far as possible, making it difficult for the opponent to return it. We think that this policy performs well against the pretrained PPO agent, whose policy cannot handle the far-reaching shots very well. In contrast to the pretrained PPO agent and like the Spike PPO self-play agent, the Arc agent performs relatively worse against the pretrained GA self-play agent than the control agent. This is likely because that model’s policy handles long-reaching shots well.

The behavior that the Arc policy exhibits is shown in the gif below, where the yellow agent is the PPO self-play Arc agent. The yellow agent consistently tries to hit the ball at a 45 degree angle and make it land farther away from the blue agent. In the example below, we can see that the blue agent has to move back quite a bit in order to get to the ball on time. It is possible that in some models, the blue agent’s policy does not allow it to do so and hence performs worse against an agent trained to hit the ball that far.

The remaining comparisons between the Arc agent and the control agent are inconclusive. For the most part, the agents behave very similarly and achieve similar results against the same opponent. It is possible that the pretrained agents learned to make shots in an arc on their own, as a method of clearing the net and landing it in their opponents’ side. As a result, rewarding that behavior in the environment did not yield much better results. The mean score against all the opponents achieved by the extended (control) agent and the Arc agent are plotted against each other in the graph below to show this.

Adding Noise

In the next experiments, we added noise to the values from the `step()` function that the agent used in training. The goal of adding noise was to observe if additional noise prevents overfitting or creates unpredictable models. We focused on adding noise to the observations and actions by extending the ObservationWrapper and ActionWrapper classes. We explored adding noise to the two wrappers independently.

Training on Noisy Actions

We explored three different ways of adding noise and making modifications to the actions of the agent through the ActionWrapper. The goal of these experiments was to observe how these modifications to the ActionWrapper would affect the training of the agent, as well as see if any of these various methods created an unpredictable model that actually performed better than the baseline model.

The actions in SlimeVolleyGym are 3-bit binary vectors as shown below. The value of the noise added was either positive or negative. If the value of the bit became greater than 0.75 after noise was added, the bit became a 1; otherwise, it was a 0.

The first experiment (_action) involved randomly changing an action to a random action rather than predicting an action with the probability of 0.07. The second experiment (_bit) involved randomly adding gaussian noise to one of the bits of the action with a probability of 0.1. The noise had a mean of 0 and a standard deviation of 1. After training and evaluation of both of these situations, we found that both the models become twitchy. Both models resulted in better performance in PPO self-play trained than the pretrained PPO and resulted in slightly better performance in PPO self-play trained against the baseline model in a normal testing environment. This is shown in the table below (to see the full chart of results, refer to the appendix). This might be due to the confusion the normal PPO experiences when it observes its opponent, PPO self-play, which behaves erratically.

Another conclusion we made was that GA self-play trained with adding noise to one of the action bits led to significantly worse performance against all models.

As shown below, PPO against PPO behaves more smoothly than PPO against PPO self-play trained with adding noise to one of the action bits.

Click on the links in the caption to see a better video of the models playing against each other.

PPO against PPO

https://drive.google.com/file/d/1EbXXUEza02Otyow1dYMctxQrPH7EkC8_/view?usp=sharing

PPO against PPO self-play bit noise

https://drive.google.com/file/d/1DyfsAExxGMeLxR-QqE0U6Ye0NQu19y7p/view?usp=sharing

The third experiment involved adding noise with a mean of 0 and a standard deviation of 0.1 to all the bits of the all of the actions. We expected the model to increase the magnitude of outputs to compensate for the additional noise. However, we found that all the models trained with this wrapper performed significantly worse against the baseline model than any of the normal pretrained models against the baseline model.

Training on Noisy Observations

The observations in SlimeVolleyGym specify the horizontal and vertical positions and velocities of the agent, ball, and opponent. The observations are represented as a 12-dimensional vector where the first four elements describe the agent, the next four elements describe the ball, and the last four elements describe the opponent. We determined that the most natural way to add noise to the observations was to add noise to all of the 12 elements. There were three different experiments done to explore the effect of adding noise to the observations, and throughout these experiments, the standard deviation of the noise varied. Across all three experiments, the mean of the noise remained 0 and the probability that the addition of the noise would occur was set to 0.2. The first experiment (obs_one) added noise with a standard deviation of 1. The second experiment (obs_half) added noise with a standard deviation of 0.5. The third experiment (obs_tenth) added noise with a standard deviation of 0.1. The final experiment (obs_noise) added noise 100% of the time with standard deviation 0.1.

The reason for testing the various values of standard deviations of noise was due to the scale of the observation, which is from approximately [-2, 2]. We wanted to determine how the intensity of the noise would affect how the agent would play and if it would assist in creating an overall better model.

Results

As seen in the table below, the PPO self-play model that was trained in the noisy environment using the ObservationWrapper with a standard deviation of 0.5 performed better against the Baseline and PPO models that were trained in the normal evaluation model. However, the PPO self-play model did significantly worse against the GA self-play as well as the Random Policy model.

In the table below, you can see that GA self-play performs significantly worse when trained in any noisy environment, and evaluated against a model in the normal environment.

The GIF below is showing an opponent trained on a baseline model in a normal environment, while one the left is a GA self-play model trained in a noisy environment with a magnitude 0.1.

Baseline vs GA

For the rest of the models we concluded that generally adding noise while training the models made them perform worse when they were evaluated in a normal environment.

Pretrained Models Evaluated in Noisy Environments

Previously, we were experimenting with adding noise to the actions or observations while training to see if it improves the model’s performance against different strategies. We noticed that adding noise while training generally resulted in a worse model. Next, we wanted to see which models were the most robust out of the pretrained models by observing their performance in noisy environments: noisy action environment, noisy observation environment, and both. We defined a robust model as a model that performed relatively well in a normal environment and stayed performing decently well or did not have a significant decrease in performance. An interesting conclusion we made was that training with noise resulted in significantly worse performance for GA self-play, but when we evaluated the pretrained GA self-play model in noisy environments, it turned out to be the most robust model overall. In general, we also discovered that evaluating in a noisy environment with either noisy observations and/or actions resulted in increased standard deviations for all of these five models.

The noisy action environment is when noise was added to all bits of the action vector with a mean of 0 and a standard deviation of 0.1. The results of the pretrained models against the noisy action environment are shown in the table below. The white rows represent evaluations done in a normal environment while the blue rows represent evaluations done in a noisy action environment. Outlined below, we observed that the GA self-play seemed to be the most robust out of all of the other models, as it became the best performer relative to other models by cumulative score.

The noisy observation environment is when noise was added to all bits of the observation vector with a mean of 0 and a standard deviation of 0.1. The results of the pretrained models against the noisy observation environment are shown in the table below. The white rows represent evaluations done in a normal environment while the blue rows represent evaluations done in a noisy observation environment. Outlined below, we found that the PPO seemed to be the most robust out of all the models, as it seemed to increase its cumulative score against some models and preserved its decent performance against baseline.

Then, the pretrained models were evaluated against an environment with both noisy actions and noisy observations. The results of the pretrained models against the noisy observation environment are shown in the table below. The white rows represent evaluations done in a normal environment while the blue rows represent evaluations done in this noisy environment. Outlined below, we found that the GA self-play seemed to be the most robust out of all the models, as it resulted in the greatest cumulative score. The green indicates increased scores in this noisy environment while the red indicates decreased scores in this noisy environment.

Conclusion

Hope you enjoyed our blog post! While we were not able to create a Q-Learning model that plays Slime Volley, we hope that our experiments on the CartPole environment provides a helpful introduction to implementing and experimenting on reinforcement learning algorithms. Furthermore, our experiments with augmenting the Slime Volley environment taught us more about the functionalities of OpenAI Gym Environments. We managed to explore the different ways of modifying the training process of learning agents, and showed that we can manipulate the models to exhibit certain behaviors. Happy coding :)

Appendix

In all the tables below, we look at the scores from the perspective of the right agent, which is represented by the rows. This includes the agents we trained on the augmented and noisy environments. The columns represent the opponents of these agents.