Robotics with Python: Q-Learning vs Actor-Critic vs Evolutionary Algorithms

There are four types of Machine Learning:

Supervised — when all the observations in the dataset are labeled with a target variable, and you can perform regression/classification to learn how to predict them.
Unsupervised — when there is no target variable, so you can perform clustering to segment and group the data.
Semi-Supervised — when the target variable is not complete, so the model has to learn how to predict unlabeled data as well. In this case, a mix of supervised and unsupervised models is used.
Reinforcement — when there is a reward instead of a target variable and you don’t know what the best solution is, so it’s more of a process of trial and error to reach a specific goal.

More precisely, Reinforcement Learning studies how an AI takes action in an interactive environment in order to maximize the reward. During supervised training, you already know the correct answer (the target variable), and you are fitting a model to replicate it. On the contrary, in a RL problem you don’t know apriori what is the correct answer, the only way to find out is by taking action and getting the feedback (the reward), so the model learns by exploring and making mistakes.

RL is being widely used for training robots. A good example is the autonomous vacuum: when it passes on a dusty part of the floor, it receives a reward (+1), but gets punished (-1) when it bumps into the wall. So the robot learns what is the right action to do and what to avoid.

In this article, I’m going to show how to build custom 3D environments for training a robot using different Reinforcement Learning algorithms. I will present some useful Python code that can be easily applied in other similar cases (just copy, paste, run) and walk through every line of code with comments so that you can replicate this example.

Setup

While a supervised usecase requires a target variable and a training set, a RL problem needs:

Environment — the surroundings of the agent, it assigns rewards for actions, and provides the new state as the result of the decision made. Basically, it’s the space the AI can interact with (in the autonomous vacuum example would be the room to clean).
Action — the set of actions the AI can do in the environment. The action space can be “discrete” (when there are a set number of moves, like the game of chess) or “continuous” (infinite possible states, like driving a car and trading).
Reward —the consequence of the action (+1/-1).
Agent — the AI learning what is the best course of action in the environment to maximize the reward.

Regarding the environment, the most used 3D physics simulators are: PyBullet (beginners) , Webots (intermediate), MuJoCo (advanced), and Gazebo (professionals). You can use any of them as standalone software or through Gym, a library made by OpenAI for developing Reinforcement Learning algorithms, built on top of different physics engines.

I will use Gym (pip install gymnasium) to load one of the default environments made with MuJoCo (Multi-Joint dynamics with Contact, pip install mujoco).

import gymnasium as gym

env = gym.make("Ant-v4")
obs, info = env.reset()

print(f"--- INFO: {len(info)} ---")
print(info, "\n")

print(f"--- OBS: {obs.shape} ---")
print(obs, "\n")

print(f"--- ACTIONS: {env.action_space} ---")
print(env.action_space.sample(), "\n")

print(f"--- REWARD ---")
obs, reward, terminated, truncated, info = env.step( env.action_space.sample() )
print(reward, "\n")

The robot Ant is a 3D quadruped agent consisting of a torso and four legs attached to it. Each leg has two body parts, so in total it has 8 joints (flexible body parts) and 9 links (solid body parts). The goal of this environment is to apply force (push/pull) and torque (twist/turn) to move the robot in a certain direction.

Let’s try the environment by running one single episode with the robot doing random actions (an episode is a complete run of the agent interacting with the environment, from start to termination).

import time

env = gym.make("Ant-v4", render_mode="human")
obs, info = env.reset()

reset = False #reset if the episode ends
episode = 1
total_reward, step = 0, 0

for _ in range(240):
    ## action
    step += 1
    action = env.action_space.sample() #random action
    obs, reward, terminated, truncated, info = env.step(action)
    ## reward
    total_reward += reward
    ## render
    env.render() #render physics step (CPU speed = 0.1 seconds)
    time.sleep(1/240) #slow down to real-time (240 steps × 1/240 second sleep = 1 second)
    if (step == 1) or (step % 100 == 0): #print first step and every 100 steps
        print(f"EPISODE {episode} - Step:{step}, Reward:{reward:.1f}, Total:{total_reward:.1f}")
    ## reset
    if reset:
        if terminated or truncated: #print the last step
            print(f"EPISODE {episode} - Step:{step}, Reward:{reward:.1f}, Total:{total_reward:.1f}")
            obs, info = env.reset()
            episode += 1
            total_reward, step = 0, 0
            print("------------------------------------------")

env.close()

Custom Environment

Usually, environments have the same properties:

Reset — to restart to an initial state or to a random point within the data.
Render — to visualize what’s happening.
Step — to execute the action chosen by the agent and change state.
Calculate Reward — to give the appropriate reward/penalties after an action.
Get Info — to collect information about the game after an action.
Terminated or Truncated — to decide whether the episode is finished after an action (fail or success).

Having default environments loaded in Gym is convenient, but it’s not always what you need. Sometimes you have to build a custom environment that meets your project requirements. This is the most delicate step for a Reinforcement Learning usecase. The quality of the model strongly depends on how well the environment is designed.

There are several ways to make your own environment:

Create from scratch: you design everything (i.e. the physics, the body, the surroundings). You have total control but it’s the most complicated way since you start with an empty world.
Modify the existing XML file: every simulated agent is designed by an XML file. You can edit the physical properties (i.e. make the robot taller or heavier) but the logic stays the same.
Modify the existing Python class: keep the agent and the physics as they are, but change the rules of the game (i.e. new rewards and termination rules). One could even turn a continuous env into a discrete action space.

I’m going to customize the default Ant environment to make the robot jump. I shall change both the physical properties in the XML file and the reward function of the Python class. Basically, I just need to give the robot stronger legs and a reward for jumping.

First of all, let’s locate the XML file, make a copy, and edit it.

import os

print(os.path.join(os.path.dirname(gym.__file__), "envs/mujoco/assets/ant.xml"))

Since my objective is to have a more “jumpy” Ant, I can reduce the density of the body to make it lighter…

…and add force to the legs so it can jump higher (the gravity in the simulator stays the same).

You can find the full edited XML file on my GitHub.

Then, I want to modify the reward function of the Gym environment. To create a custom env, you have to build a new class that overwrites the original one where it’s needed (in my case, how the reward is calculated). After the new env is registered, it can be used like any other Gym env.

from gymnasium.envs.mujoco.ant_v4 import AntEnv
from gymnasium.envs.registration import register
import numpy as np

## modify the class
class CustomAntEnv(AntEnv):
    def __init__(self, **kwargs):
        super().__init__(xml_file=os.getcwd()+"/assets/custom_ant.xml", **kwargs) #specify xml_file only if modified

    def CUSTOM_REWARD(self, action, info):
        torso_height = float(self.data.qpos[2]) #torso z-coordinate = how high it is
        reward = np.clip(a=torso_height-0.6, a_min=0, a_max=1) *10 #when the torso is high
        terminated = bool(torso_height < 0.2 ) #if torso close to the ground
        info["torso_height"] = torso_height #add info for logging
        return reward, terminated, info

    def step(self, action):
        obs, reward, terminated, truncated, info = super().step(action) #override original step()
        new_reward, new_terminated, new_info = self.CUSTOM_REWARD(action, info)
        return obs, new_reward, new_terminated, truncated, new_info #must return the same things

    def reset_model(self):
        return super().reset_model() #keeping the reset as it is

## register the new env
register(id="CustomAntEnv-v1", entry_point="__main__:CustomAntEnv")

## test
env = gym.make("CustomAntEnv-v1", render_mode="human")
obs, info = env.reset()
for _ in range(1000):
    action = env.action_space.sample()
    obs, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        obs, info = env.reset()
env.close()

If the 3D world and its rules are well designed, you just need a good RL model, and the robot will do anything to maximize the reward. There are two families of models that dominate the RL scene: Q-Learning models (best for discrete action spaces) and Actor-Critic models (best for continuous action spaces). Besides those, there are some newer and more experimental approaches emerging, like Evolutionary algorithms and Imitation learning.

Q Learning

Q-Learning is the most basic form of Reinforcement Learning and uses Q-values (the “Q” stands for “quality”) to represent how useful an action is in gaining some future reward. To put it in simple terms, if at the end of the game the agent gets a certain reward after a set of actions, the initial Q-value is the discounted future reward.

As the agent explores and receives feedback, it updates the Q-values stored in the Q-matrix (Bellman equation). The goal of the agent is to learn the optimal Q-values for each state/action, so that it can make the best decisions and maximize the expected future reward for a specific action in a specific state.

During the learning process, the agent uses an exploration-exploitation trade-off. Initially, it explores the environment by taking random actions, allowing it to gather experience (information about the rewards associated with different actions and states). As it learns and the level of exploration decays, it starts exploiting its knowledge by selecting the actions with the highest Q-values for each state.

Please note that the Q-matrix can be multidimensional and much more complicated. For instance, let’s think of a trading algorithm:

In 2013, there was a breakthrough in the field of Reinforcement Learning when Google introduced Deep Q-Network (DQN), designed to learn to play Atari games from raw pixels, combining the two concepts of Deep Learning and Q-Learning. To put it in simple terms, Deep Learning is used to approximate the Q-values instead of explicitly storing them in a table. This is done through a Neural Network trained to predict the Q-values for each possible action, using the current state of the environment as input.

Q-Learning family was mainly designed for discrete environments, so it doesn’t really work on the robot Ant. An alternative solution would be to discretize the environment (even if it’s not the most efficient way to approach a continuous problem). We just have to create a wrapper for the Python class that expects a discrete action (i.e. “move forward”), and consequently applies force to the joints based on that command.

class DiscreteEnvWrapper(gym.Env):
    
    def __init__(self, render_mode=None): 
        super().__init__() 
        self.env = gym.make("CustomAntEnv-v1", render_mode=render_mode) 
        self.action_space = gym.spaces.Discrete(5)  #will have 5 actions 
        self.observation_space = self.env.observation_space #same observation space
        n_joints = self.env.action_space.shape[0]         
        self.action_map = [
            ## action 0 = stand still 
            np.zeros(n_joints),
            ## action 1 = push all forward
            0.5*np.ones(n_joints),
            ## action 2 = push all backward
           -0.5*np.ones(n_joints),
            ## action 3 = front legs forward + back legs backward 
            0.5*np.concatenate([np.ones(n_joints//2), -np.ones(n_joints//2)]),
            ## action 4 = front legs backward + back legs forward 
            0.5*np.concatenate([-np.ones(n_joints//2), np.ones(n_joints//2)])
        ] 
        
    def step(self, discrete_action): 
        assert self.action_space.contains(discrete_action) 
        continuous_action = self.action_map[discrete_action] 
        obs, reward, terminated, truncated, info = self.env.step(continuous_action) 
        return obs, reward, terminated, truncated, info
        
    def reset(self, **kwargs): 
        obs, info = self.env.reset(**kwargs) 
        return obs, info 
    
    def render(self): 
        return self.env.render() 
    
    def close(self): 
        self.env.close()

## test
env = DiscreteEnvWrapper()
obs, info = env.reset()

print(f"--- INFO: {len(info)} ---")
print(info, "\n")

print(f"--- OBS: {obs.shape} ---")
print(obs, "\n")

print(f"--- ACTIONS: {env.action_space} ---")
discrete_action = env.action_space.sample()
continuous_action = env.action_map[discrete_action] 
print("discrete:", discrete_action, "-> continuous:", continuous_action, "\n")

print(f"--- REWARD ---")
obs, reward, terminated, truncated, info = env.step( discrete_action )
print(reward, "\n")

Now this environment, with just 5 possible actions, will definitely work with DQN. In Python, the easiest way to use Deep RL algorithms is through StableBaseline (pip install stable-baselines3), a collection of the most famous models, already pre-implemented and ready to go, all written in PyTorch (pip install torch). Additionally, I find it very useful to look at the training progress on TensorBoard (pip install tensorboard). I created a folder named “logs”, and I can just run tensorboard --logdir=logs/ on the terminal to serve the dashboard locally (http://localhost:6006/).

import stable_baselines3 as sb
from stable_baselines3.common.vec_env import DummyVecEnv

# TRAIN
env = DiscreteEnvWrapper(render_mode=None) #no rendering to speed up
env = DummyVecEnv([lambda:env]) 
model_name = "ant_dqn"

print("Training START")
model = sb.DQN(policy="MlpPolicy", env=env, verbose=0, learning_rate=0.005,
               exploration_fraction=0.2, exploration_final_eps=0.05, #eps decays linearly from 1 to 0.05
               tensorboard_log="logs/") #>tensorboard --logdir=logs/
model.learn(total_timesteps=1_000_000, #20min
            tb_log_name=model_name, log_interval=10)
print("Training DONE")

model.save(model_name)

After the training is complete, we can load the new model and test it in the rendered environment. Now, the agent won’t be updating the preferred actions anymore. Instead, it will use the trained model to predict the next best action given the current state.

# TEST
env = DiscreteEnvWrapper(render_mode="human")
model = sb.DQN.load(path=model_name, env=env)
obs, info = env.reset()

reset = False #reset if episode ends
episode = 1
total_reward, step = 0, 0

for _ in range(1000):
    ## action
    step += 1
    action, _ = model.predict(obs)    
    obs, reward, terminated, truncated, info = env.step(action) 
    ## reward
    total_reward += reward
    ## render
    env.render() 
    time.sleep(1/240)
    if (step == 1) or (step % 100 == 0): #print first step and every 100 steps
        print(f"EPISODE {episode} - Step:{step}, Reward:{reward:.1f}, Total:{total_reward:.1f}")
    ## reset
    if reset:
        if terminated or truncated: #print the last step
            print(f"EPISODE {episode} - Step:{step}, Reward:{reward:.1f}, Total:{total_reward:.1f}")
            obs, info = env.reset()
            episode += 1
            total_reward, step = 0, 0
            print("------------------------------------------")

env.close()

As you can see, the robot learned that the best policy is to jump, but the movements aren’t fluid because we didn’t use a model designed for continuous actions.

Actor Critic

In practice, the Actor-Critic algorithms are the most used as they are well suited for continuous environments. The basic idea is to have two systems working together: a policy function (“Actor”) for selecting actions, and a value function (“Critic”) to estimate the expected reward. The model learns how to adjust the decision making by comparing the actual rewards it receives with the predictions.

The first stable Deep Learning algorithm was introduced by OpenAI in 2016: Advantage Actor-Critic (A2C). It aims to minimize the loss between the actual reward received after the Actor takes action and the reward estimated by the Critic. The Neural Network is made of an input layer shared by both the Actor and the Critic, but they return two separate outputs: actions’ Q-values (just like DQN), and predicted reward (which is the addition of A2C).

Over the years, the AC algorithms have been improving with more stable and efficient variants, like Proximal Policy Optimization (PPO), and Soft Actor Critic (SAC). The latter uses, not one, but two Critic networks to get a “second opinion”. Remember that we can use these models directly in the continuous environment.

# TRAIN
env_name, model_name = "CustomAntEnv-v1", "ant_sac"
env = gym.make(env_name) #no rendering to speed up
env = DummyVecEnv([lambda:env])

print("Training START")
model = sb.SAC(policy="MlpPolicy", env=env, verbose=0, learning_rate=0.005, 
                ent_coef=0.005, #exploration
                tensorboard_log="logs/") #>tensorboard --logdir=logs/
model.learn(total_timesteps=100_000, #3h
            tb_log_name=model_name, log_interval=10)
print("Training DONE")

## save
model.save(model_name)

The training of the SAC requires more time, but the results are much better.

# TEST
env = gym.make(env_name, render_mode="human")
model = sb.SAC.load(path=model_name, env=env)
obs, info = env.reset()

reset = False #reset if the episode ends
episode = 1
total_reward, step = 0, 0

for _ in range(1000):
    ## action
    step += 1
    action, _ = model.predict(obs)    
    obs, reward, terminated, truncated, info = env.step(action) 
    ## reward
    total_reward += reward
    ## render
    env.render() 
    time.sleep(1/240)
    if (step == 1) or (step % 100 == 0): #print first step and every 100 steps
        print(f"EPISODE {episode} - Step:{step}, Reward:{reward:.1f}, Total:{total_reward:.1f}")
    ## reset
    if reset:
        if terminated or truncated: #print the last step
            print(f"EPISODE {episode} - Step:{step}, Reward:{reward:.1f}, Total:{total_reward:.1f}")
            obs, info = env.reset()
            episode += 1
            total_reward, step = 0, 0
            print("------------------------------------------")

env.close()

Given the popularity of Q-Learning and Actor-Critic, there have been more recent hybrid adaptations combining the two approaches. In this way, they also extend DQN to continuous action spaces. For example, Deep Deterministic Policy Gradient (DDPG) and Twin Delayed DDPG (TD3). But, beware that the more complex the model, the harder the training.

Experimental Models

Besides the main families (Q and AC), you can find other models that are less used in practice, but no less interesting. In particular, they can be powerful alternatives for tasks where rewards are sparse and hard to design. For example:

Evolutionary Algorithms evolve the policies through mutation and selection instead of a gradient. Inspired by Darwin’s evolution, they are robust but computationally heavy.
Imitation Learning skips exploration and trains agents to mimic expert demonstrations. It’s based on the concept of “behavioral cloning”, blending supervised learning with RL ideas.

For experimental purposes, let’s try the first one with EvoTorch, an open-source toolkit for neuroevolution. I’m choosing this because it works well with PyTorch and Gym (pip install evotorch).

The best Evolutionary Algorithm for RL is Policy Gradients with Parameter Exploration (PGPE). Essentially, it doesn’t train one Neural Network directly, instead it builds a probability distribution (Gaussian) over all possible weights (μ=average set of weights, σ=exploration around the center). In every generation, PGPE samples from the weights population, starting with a random policy. Then, the model adjusts the mean and variance based on the reward (evolution of the population). PGPE is considered Parallelized RL because, unlike classic methods like Q and AC, which update one policy using batches of samples, PGPE samples many policy variations in parallel.

Before running the training, we have to define the “problem”, which is the task to optimize (basically our environment).

from evotorch.neuroevolution import GymNE
from evotorch.algorithms import PGPE
from evotorch.logging import StdOutLogger

## problem
train = GymNE(env=CustomAntEnv, #directly the class because it's custom env
              env_config={"render_mode":None}, #no rendering to speed up
              network="Linear(obs_length, act_length)", #linear policy
              observation_normalization=True,
              decrease_rewards_by=1, #normalization trick to stabilize evolution
              episode_length=200, #steps per episode
              num_actors="max") #use all available CPU cores

## model
model = PGPE(problem=train, popsize=20, stdev_init=0.1, #keep it small
             center_learning_rate=0.005, stdev_learning_rate=0.1,
             optimizer_config={"max_speed":0.015})

## train
StdOutLogger(searcher=model, interval=20)
model.run(num_generations=100)

In order to test the model, we need another “problem” that renders the simulation. Then, we just extract the best-performing set of weights from the distribution center (that’s because during the training the Gaussian shifted toward better regions of policy space).

## visualization problem
test = GymNE(env=CustomAntEnv, env_config={"render_mode":"human"},
             network="Linear(obs_length, act_length)",
             observation_normalization=True,
             decrease_rewards_by=1,
             num_actors=1) #only need 1 for visualization

## test best policy
population_center = model.status["center"]
policy = test.to_policy(population_center)

## render
test.visualize(policy)

Conclusion

This article has been a tutorial on how to use Reinforcement Learning for Robotics. I showed how to build 3D simulations with Gym and MuJoCo, how to customize an environment, and what RL algorithms are more suited for different usecases. New tutorials with more advanced robots will come.

Full code for this article: GitHub

I hope you enjoyed it! Feel free to contact me for questions and feedback or just to share your interesting projects.

👉 Let’s Connect 👈

_{^{(All images are by the author unless otherwise noted)}}

Source link

Sign Up to Our Newsletter

Top Categories

Tech News

Tech

Software development

Robotics

Popular Tech News

SwitchBot S20 robot vacuum review: impressive mopping, but...

Walmart will sell you this $89 LG UltraGear...

Cera acquires robotics platform and scales home care...

Shark Stratos Upright AZ3002 review: a straightforward vacuum...