DIY AI & ML: Solving The Multi-Armed Bandit Problem with Thompson Sampling

Introduction

of data-driven decision-making. Not only do most organizations maintain massive databases of information, but they also have countless teams that rely on this data to inform their decision-making. From clickstream traffic to wearable edge devices, telemetry, and much more, the speed and scale of data-driven decision-making are increasing exponentially, driving the popularity of integrating machine learning and AI frameworks.

Speaking of data-driven decision-making frameworks, one of the most reliable and time-tested approaches is A/B testing. A/B testing is especially popular among websites, digital products, and similar outlets where customer feedback in the form of clicks, orders, etc., is received nearly instantly and at scale. What makes A/B testing such a powerful decision framework is the ability to control for countless variables so that a stakeholder can see the effect the element they are introducing in the test has on a key performance indicator (KPI).

Like all things, there are drawbacks to A/B testing, particularly the time it can take. Following the conclusion of a test, someone must communicate the results, and stakeholders must use the appropriate channels to reach a decision and implement it. All that lost time can translate into an opportunity cost, assuming the test experience demonstrated an impact. What if there were a framework or an algorithm that could systematically automate this process? This is where Thompson Sampling comes into play.

The Multi-Armed Bandit Problem

Imagine you go to the casino for the first time and, standing before you, are three slot machines: Machine A, Machine B, and Machine C. You have no idea which machine has the highest payout; however, you come up with a clever idea. For the first few pulls, assuming you don’t run out of luck, you pull the slot machine arms at random. After each pull, you record the result. After a few iterations, you take a look at your results, and you take a look at the win rate for each machine:

Machine A: 40%
Machine B: 30%
Machine C: 50%

At this point, you decide to pull Machine C at a slightly higher rate than the other two, as you believe there is more evidence that Machine C has the highest win rate, yet you want to collect more data to be sure. After the next few iterations, you take a look at the new results:

Machine A: 45%
Machine B: 25%
Machine C: 60%

Now, you have a lot more confidence that Machine C has the highest win rate. This hypothetical example is what gave the Multi-Armed Bandit Problem its name and is a classic example of how Thompson Sampling is utilized.

This Bayesian algorithm is designed to choose between multiple options with unknown reward distributions and maximize the expected reward. It accomplishes this by the exploration-exploitation tradeoff. Since the reward distributions are unknown, the algorithm chooses options at random, collects data on the results, and, over time, progressively chooses options at a higher rate that yield a higher average reward.

In this article, I will walk you through how you can build your own Thompson Sampling Algorithm object in Python and apply it to a hypothetical yet real-life example.

Email Headlines — Optimizing the Open Rate

Image by Mariia Shalabaieva
on Unsplash. Free to use under the Unsplash License

In this example, assume the role of someone in a marketing organization charged with email campaigns. In the past, the team tested which headlines led to higher email open rates using an A/B testing framework. Still, this time, you recommend implementing a multi-armed bandit approach to start realizing value faster.

To demonstrate the effectiveness of a Thompson Sampling (also known as the bandit) approach, I will build a Python simulation that compares it to a random approach. Let’s get started.

Step 1 – Base Email Simulation

This will be the main object for this project; it will serve as a base template for both the random and bandit simulations. The initialization function stores some basic information needed to execute the email simulation, specifically, the headlines of each email and the true open rates. One item I want to stress is the true open rates. They will”be “unknown” to the actual simulation and will be treated as probabilities when an email is sent. A random number generator object is also created to allow one to replicate a simulation, which can be useful. Finally, we have a built-in function, reset_results(), that I will discuss next.

import numpy as np
import pandas as pd

class BaseEmailSimulation:
    """
    Base class for email headline simulations.

    Shared responsibilities:
    - store headlines and their true open probabilities
    - simulate a binary email-open outcome
    - reset simulation state
    - build a summary table from the latest run
    """

    def __init__(self, headlines, true_probabilities, random_state=None):
        self.headlines = list(headlines)
        self.true_probabilities = np.array(true_probabilities, dtype=float)

        if len(self.headlines) == 0:
            raise ValueError("At least one headline must be provided.")

        if len(self.headlines) != len(self.true_probabilities):
            raise ValueError("headlines and true_probabilities must have the same length.")

        if np.any(self.true_probabilities < 0) or np.any(self.true_probabilities > 1):
            raise ValueError("All true_probabilities must be between 0 and 1.")

        self.n_arms = len(self.headlines)
        self.rng = np.random.default_rng(random_state)

        # Ground-truth best arm information for evaluation
        self.best_arm_index = int(np.argmax(self.true_probabilities))
        self.best_headline = self.headlines[self.best_arm_index]
        self.best_true_probability = float(self.true_probabilities[self.best_arm_index])

        # Results from the latest completed simulation
        self.reset_results()

reset_results()

For each simulation, it is useful to have many details, including:

Which headline was selected at each step
whether or not the email sent resulted in an open
Overall opens and open rate

The attributes aren’t explicitly defined in this function; they’ll be defined later. Instead, this function resets them, allowing a fresh history for each simulation run. This is especially important for the bandit subclass, which I will show you later in the article.

def reset_results(self):
    """
    Clear all results from the latest simulation.
    Called automatically at initialization and at the start of each run().
    """
    self.reward_history = []
    self.selection_history = []
    self.history = pd.DataFrame()
    self.summary_table = pd.DataFrame()
    self.total_opens = 0
    self.cumulative_opens = []

send_email()

The next function that needs to be featured is how the email sends will be executed. Given an arm index (headline index), the function samples exactly one value from a binomial distribution with the true probability rate for that headline with exactly one independent trial. This is a practical approach, as sending an email has exactly two outcomes: it is opened or ignored. Opened and ignored will be represented by 1 and 0, respectively, and the binomial function from numpy will do just that, with the chance of return”n” “1” being equal to the true probability of the respective email headline.

def send_email(self, arm_index):
    """
    Simulate sending an email with the selected headline.

    Returns
    -------
    int
        1 if opened, 0 otherwise.
    """
    if arm_index < 0 or arm_index >= self.n_arms:
        raise IndexError("arm_index is out of bounds.")

    true_p = self.true_probabilities[arm_index]
    reward = self.rng.binomial(n=1, p=true_p)

    return int(reward)

_finalize_history() & build_summary_table()

Lastly, these two functions work in conjunction by taking the results of a simulation and building a clean summary table that shows metrics such as the number of times a headline was selected, opened, true open rate, and the realized open rate.

def _finalize_history(self, records):
    """
    Convert round-level records into a DataFrame and populate
    shared result attributes.
    """
    self.history = pd.DataFrame(records)

    if not self.history.empty:
        self.reward_history = self.history["reward"].tolist()
        self.selection_history = self.history["arm_index"].tolist()
        self.total_opens = int(self.history["reward"].sum())
        self.cumulative_opens = self.history["reward"].cumsum().tolist()
    else:
        self.reward_history = []
        self.selection_history = []
        self.total_opens = 0
        self.cumulative_opens = []

    self.summary_table = self.build_summary_table()

def build_summary_table(self):
    """
    Build a summary table from the latest completed simulation.

    Returns
    -------
    pd.DataFrame
        Summary by headline.
    """
    if self.history.empty:
        return pd.DataFrame(columns=[
            "arm_index",
            "headline",
            "selections",
            "opens",
            "realized_open_rate",
            "true_open_rate"
        ])

    summary = (
        self.history
        .groupby(["arm_index", "headline"], as_index=False)
        .agg(
            selections=("reward", "size"),
            opens=("reward", "sum"),
            realized_open_rate=("reward", "mean"),
            true_open_rate=("true_open_rate", "first")
        )
        .sort_values("arm_index")
        .reset_index(drop=True)
    )

    return summary

Step 2 – Subclass: Random Email Simulation

In order to properly gauge the impact of a multi-armed bandit approach for email headlines, we need to compare it against a benchmark, in this case, a randomized approach, which also mirrors how an A/B test is executed.

select_headline()

This is the core of the Random Email Simulation class, select_headline() chooses an integer between 0 and the number of headlines (or arms) at random.

def select_headline(self):
    """
    Select one headline uniformly at random.
    """
    return int(self.rng.integers(low=0, high=self.n_arms))

run()

This is how the simulation is executed. All that is needed is the number of iterations from the end user. It leverages the select_headline() function in tandem with the send_email() function from the parent class. At each round, an email is sent, and results are recorded.

def run(self, num_iterations):
    """
    Run a fresh random simulation from scratch.

    Parameters
    ----------
    num_iterations : int
        Number of simulated email sends.
    """
    if num_iterations <= 0:
        raise ValueError("num_iterations must be greater than 0.")

    self.reset_results()
    records = []
    cumulative_opens = 0

    for round_number in range(1, num_iterations + 1):
        arm_index = self.select_headline()
        reward = self.send_email(arm_index)
        cumulative_opens += reward

        records.append({
            "round": round_number,
            "arm_index": arm_index,
            "headline": self.headlines[arm_index],
            "reward": reward,
            "true_open_rate": self.true_probabilities[arm_index],
            "cumulative_opens": cumulative_opens
        })

    self._finalize_history(records)

Thompson Sampling & Beta Distributions

Before diving into our bandit subclass, it is essential to cover the mathematics behind Thompson Sampling in more detail. I will cover this via our hypothetical email example in this article.

Let’s first consider what we know so far about our current situation. There is a set of email headlines, and we know each has an associated open rate. We need a framework to decide which email headline to send to a customer. Before going further, let’s define some variables:

Headlines:
- 1: “Your Exclusive Spring Offer Is here.”
- 2: “48 Hours Only: Save 25%
- 3: “Don’t Miss Your Member Discount”
- 4: “Ending Tonight: Final Chance to have”
- 5: “A Little Something Just for You”

A_i = Headline (arm) at index i

t_i = Time or the current number of the iteration (email send) to be performed

r_i = The reward observed at time t_i, result will be open or ignored

We have yet to send the first email. Which headline should we select? This is where the Beta Distribution comes into play. A Beta Distribution is a continuous probability distribution defined on the interval (0,1). It has two key variables representing successes and failures, respectively, alpha & beta. At time t = 1, all headlines start with alpha = 1 and beta = 1. Email opens add 1 to alpha; otherwise, beta gets incremented by 1.

At first glance, you might think the algorithm is assuming a 50% true open rate at the start. This is not necessarily the case, and this assumption would completely neglect the whole point of the Thompson Sampling approach: the exploration-exploitation tradeoff. The alpha and beta variables are used to build a Beta Distribution for each individual headline. Prior to the first iteration, these distributions will look something like this:

I promise there is more to it than just a horizontal line. The x-axis represents probabilities from 0 to 1. The y-axis represents the density for each probability, or the area under the curve. Using this distribution, we sample a random value for each email, then use the highest value as the email’s headline. On this first iteration, the decision framework is purely random. Why? Each value has the same area under the curve. But what about after a few more iterations? Remember, each reward is either added to alpha or to beta in the respective beta distribution. Let’s see what the distribution looks like with alpha = 10 and beta = 10.

There certainly is a difference, but what does that mean in the context of our problem? First of all, if alpha and beta are equal to 10, it means we selected that headline 18 times and observed 9 successes (email opens) and 9 failures (email ignored). Thus, the realized open rate for this headline is 0.5, or 50%. Remember, we always start with alpha and beta equal to 1. If we randomly sample a value from this distribution, what do you think it will be? Most likely, something close to 0.5, but it is not guaranteed. Let’s look at one more example and set alpha and beta equal to 100.

Now there is a much higher chance that a randomly sampled value will be somewhere around 0.5. This progression demonstrates how Thompson Sampling seamlessly moves from exploration to exploitation. Let’s see how we can build an object that executes this framework.

Step 3 – Subclass: Bandit Email Simulation

Let’s take a look at some key attributes, starting with alpha_prior and beta_prior. They are set to 1 each time a BanditSimulation() object is initialized. “Prior” is a key term in this context. At each iteration, our decision about which headline to send depends on a probability distribution, known as the Posterior. Next, this object inherits a few select attributes from the BaseEmailSimulation parent class. Finally, a custom function called reset_bandit_state() is called. Let’s discuss that function next.

class BanditSimulation(BaseEmailSimulation):
    """
    Thompson Sampling email headline simulation.

    Each headline is modeled with a Beta posterior over its
    unknown open probability. At each iteration, one sample is drawn
    from each posterior, and the headline with the largest sample is selected.
    """

    def __init__(
        self,
        headlines,
        true_probabilities,
        alpha_prior=1.0,
        beta_prior=1.0,
        random_state=None
    ):
        super().__init__(
            headlines=headlines,
            true_probabilities=true_probabilities,
            random_state=random_state
        )

        if alpha_prior <= 0 or beta_prior <= 0:
            raise ValueError("alpha_prior and beta_prior must be positive.")

        self.alpha_prior = float(alpha_prior)
        self.beta_prior = float(beta_prior)

        self.reset_bandit_state()

reset_bandit_state()

The objects I have built for this article are intended to run in a simulation; therefore, we need to include failsafes to prevent data leakage between simulations. The reset_bandit_state() function accomplishes this by resetting the posterior for each headline whenever it is run or when a new Bandit class is initiated. Otherwise, we risk running a simulation as if the data had already been gathered, which defeats the whole purpose of a Thompson Sampling approach.

def reset_bandit_state(self):
    """
    Reset posterior state for a fresh Thompson Sampling run.
    """
    self.alpha = np.full(self.n_arms, self.alpha_prior, dtype=float)
    self.beta = np.full(self.n_arms, self.beta_prior, dtype=float)

Selection & Reward Functions

Starting with posterior_means(), we can use this function to return the realized open rate for any given headline. The next function, select_headline(), samples a random value from a headline’s posterior and returns the index of the largest value. Finally, we have update_posterior(), which increments alpha or beta for a selected headline based on the reward.

def posterior_means(self):
    """
    Return the posterior mean for each headline.
    """
    return self.alpha / (self.alpha + self.beta)

def select_headline(self):
    """
    Draw one sample from each arm's Beta posterior and
    select the headline with the highest sampled value.
    """
    sampled_values = self.rng.beta(self.alpha, self.beta)
    return int(np.argmax(sampled_values))

def update_posterior(self, arm_index, reward):
    """
    Update the selected arm's Beta posterior using the observed reward.
    """
    if arm_index < 0 or arm_index >= self.n_arms:
        raise IndexError("arm_index is out of bounds.")

    if reward not in (0, 1):
        raise ValueError("reward must be either 0 or 1.")

    self.alpha[arm_index] += reward
    self.beta[arm_index] += (1 - reward)

run() and build_summary_table()

Everything is in place to execute a Thompson Sampling-driven simulation. Note, we call reset_results() and reset_bandit_state() to ensure we have a fresh run, so as not to rely on previous information. At the end of each simulation, results are aggregated and summarized via the custom build_summary_table() function.

def run(self, num_iterations):
    """
    Run a fresh Thompson Sampling simulation from scratch.

    Parameters
    ----------
    num_iterations : int
        Number of simulated email sends.
    """
    if num_iterations <= 0:
        raise ValueError("num_iterations must be greater than 0.")

    self.reset_results()
    self.reset_bandit_state()

    records = []
    cumulative_opens = 0

    for round_number in range(1, num_iterations + 1):
        arm_index = self.select_headline()
        reward = self.send_email(arm_index)
        self.update_posterior(arm_index, reward)

        cumulative_opens += reward

        records.append({
            "round": round_number,
            "arm_index": arm_index,
            "headline": self.headlines[arm_index],
            "reward": reward,
            "true_open_rate": self.true_probabilities[arm_index],
            "cumulative_opens": cumulative_opens,
            "posterior_mean": self.posterior_means()[arm_index],
            "alpha": self.alpha[arm_index],
            "beta": self.beta[arm_index]
        })

    self._finalize_history(records)

    # Rebuild summary table with extra posterior columns
    self.summary_table = self.build_summary_table()

def build_summary_table(self):
    """
    Build a summary table for the latest Thompson Sampling run.
    """
    if self.history.empty:
        return pd.DataFrame(columns=[
            "arm_index",
            "headline",
            "selections",
            "opens",
            "realized_open_rate",
            "true_open_rate",
            "final_posterior_mean",
            "final_alpha",
            "final_beta"
        ])

    summary = (
        self.history
        .groupby(["arm_index", "headline"], as_index=False)
        .agg(
            selections=("reward", "size"),
            opens=("reward", "sum"),
            realized_open_rate=("reward", "mean"),
            true_open_rate=("true_open_rate", "first")
        )
        .sort_values("arm_index")
        .reset_index(drop=True)
    )

    summary["final_posterior_mean"] = self.posterior_means()
    summary["final_alpha"] = self.alpha
    summary["final_beta"] = self.beta

    return summary

Running the Simulation

Image by Markus Spiske
on Unsplash. Free to use under the Unsplash License

One final step before running the simulation, take a look at a custom function I built specifically for this step. This function runs several simulations given a list of iterations. It also outputs a detailed summary directly comparing the random and bandit approaches, specifically showing key metrics such as the additional email opens from the bandit, the overall open rates, and the lift between the bandit open rate and the random open rate.

def run_comparison_experiment(
    headlines,
    true_probabilities,
    iteration_list=(100, 1000, 10000, 100000, 1000000),
    random_seed=42,
    bandit_seed=123,
    alpha_prior=1.0,
    beta_prior=1.0
):
    """
    Run RandomSimulation and BanditSimulation side by side across
    multiple iteration counts.

    Returns
    -------
    comparison_df : pd.DataFrame
        High-level comparison table across iteration counts.

    detailed_results : dict
        Nested dictionary containing simulation objects and summary tables
        for each iteration count.
    """

    comparison_rows = []
    detailed_results = {}

    for n in iteration_list:
        # Fresh objects for each simulation size
        random_sim = RandomSimulation(
            headlines=headlines,
            true_probabilities=true_probabilities,
            random_state=random_seed
        )

        bandit_sim = BanditSimulation(
            headlines=headlines,
            true_probabilities=true_probabilities,
            alpha_prior=alpha_prior,
            beta_prior=beta_prior,
            random_state=bandit_seed
        )

        # Run both simulations
        random_sim.run(num_iterations=n)
        bandit_sim.run(num_iterations=n)

        # Core metrics
        random_opens = random_sim.total_opens
        bandit_opens = bandit_sim.total_opens

        random_open_rate = random_opens / n
        bandit_open_rate = bandit_opens / n

        additional_opens = bandit_opens - random_opens

        opens_lift_pct = (
            ((bandit_opens - random_opens) / random_opens) * 100
            if random_opens != 0 else np.nan
        )

        open_rate_lift_pct = (
            ((bandit_open_rate - random_open_rate) / random_open_rate) * 100
            if random_open_rate != 0 else np.nan
        )

        comparison_rows.append({
            "iterations": n,
            "random_opens": random_opens,
            "bandit_opens": bandit_opens,
            "additional_opens_from_bandit": additional_opens,
            "opens_lift_pct": opens_lift_pct,
            "random_open_rate": random_open_rate,
            "bandit_open_rate": bandit_open_rate,
            "open_rate_lift_pct": open_rate_lift_pct
        })

        detailed_results[n] = {
            "random_sim": random_sim,
            "bandit_sim": bandit_sim,
            "random_summary_table": random_sim.summary_table.copy(),
            "bandit_summary_table": bandit_sim.summary_table.copy()
        }

    comparison_df = pd.DataFrame(comparison_rows)

    # Optional formatting helpers
    comparison_df["random_open_rate"] = comparison_df["random_open_rate"].round(4)
    comparison_df["bandit_open_rate"] = comparison_df["bandit_open_rate"].round(4)
    comparison_df["opens_lift_pct"] = comparison_df["opens_lift_pct"].round(2)
    comparison_df["open_rate_lift_pct"] = comparison_df["open_rate_lift_pct"].round(2)

    return comparison_df, detailed_results

Reviewing the Results

Here is the code for running both simulations and the comparison, along with a set of email headlines and the corresponding true open rate. Let’s see how the bandit performed!

headlines = [
    "48 Hours Only: Save 25%",
    "Your Exclusive Spring Offer Is Here",
    "Don’t Miss Your Member Discount",
    "Ending Tonight: Final Chance to Save",
    "A Little Something Just for You"
]

true_open_rates = [0.18, 0.21, 0.16, 0.24, 0.20]

comparison_df, detailed_results = run_comparison_experiment(
    headlines=headlines,
    true_probabilities=true_open_rates,
    iteration_list=(100, 1000, 10000, 100000, 1000000),
    random_seed=42,
    bandit_seed=123
)

display_df = comparison_df.copy()
display_df["random_open_rate"] = (display_df["random_open_rate"] * 100).round(2).astype(str) + "%"
display_df["bandit_open_rate"] = (display_df["bandit_open_rate"] * 100).round(2).astype(str) + "%"
display_df["opens_lift_pct"] = display_df["opens_lift_pct"].round(2).astype(str) + "%"
display_df["open_rate_lift_pct"] = display_df["open_rate_lift_pct"].round(2).astype(str) + "%"

display_df

At 100 iterations, there is no real difference between the two approaches. At 1,000, it’s a similar outcome, except the bandit approach is lagging this time. Now look at what happens in the final three iterations with 10,000 or more: the bandit approach consistently outperforms by 20%! That number may not seem like much; however, imagine it is for a large enterprise that can send millions of emails in a single campaign. That 20% could deliver millions of dollars in incremental revenue.

My Final Thoughts

The Thompson Sampling approach can certainly be a powerful tool in the digital world, particularly as an online A/B testing alternative for campaigns and recommendations. That being said, it has the potential to work out much better in some scenarios more than others. To conclude, here is a quick checklist one can utilize to determine if a Thompson Sampling approach could prove to be valuable:

A single, clear KPI
- The approach depends on a single outcome for rewards; therefore, whatever the underlying activity, the success metric of that activity must have a clear, single outcome to be considered successful.
A near instant reward mechanism
- The reward mechanism needs to be somewhere between near instantaneous and within a matter of minutes once the activity is impressed upon the customer or user. This allows the algorithm to receive feedback quickly, thereby optimizing faster.
Bandwidth or Budget for a large number of iterations
- This isn’t a magic number for how many email sends, page views, impressions, etc., one must achieve to have an effective Thompson Sampling activity; however, if you refer back to the simulation results, the bigger the better.
Multiple & Distinct Arms
- Arms, as the metaphor from the bandit problem, whatever the experience, the variations, such as the email headlines, need to be distinct or have high variability to ensure one is maximizing the exploration space. For example, if you are testing the color of a landing page, instead of testing different shades of a single color, consider testing completely different colors.

I hope you enjoyed my introduction and simulation with Thompson Sampling and the Multi-Armed Bandit problem! If you can find a suitable outlet for it, you may find it extremely valuable.

Source link

Sign Up to Our Newsletter

Top Categories

Uncategorized

Tech News

Tech

Software development

Popular Tech News