under uncertainty is a central concern for product teams. Decisions large and small often have to be made under time pressure, despite incomplete — and potentially inaccurate — information about the problem and solution space. This may be due to a lack of relevant user research, limited knowledge about the intricacies of the business context (typically seen in companies that do too little to foster customer centricity and cross-team collaboration), and/or a flawed understanding of what a certain technology can and cannot do (particularly when building front-runner products with novel, untested technologies).
The situation is especially challenging for AI product teams for at least three reasons. First, many AI algorithms are inherently probabilistic in nature and thus yield uncertain outcomes (e.g., model predictions may be right or wrong with a certain probability). Second, a sufficient quantity of high-quality, relevant data may not always be available to properly train AI systems. Third, the recent explosion in hype around AI — and more specifically, generative AI — has led to unrealistic expectations among customers, Wall Street analysts and (inevitably) decision makers in upper management; the feeling among many of these stakeholders seems to be that virtually anything can now be solved easily with AI. Needless to say, it can be difficult for product teams to manage such expectations.
So, what hope is there for AI product teams? While there is no silver bullet, this article introduces readers to the notion of expected value and how it can be used to guide decision making in AI product management. After a brief overview of key theoretical concepts, we will look at three real-life case studies that underscore how expected value analysis can help AI product teams make strategic decisions under uncertainty across the product lifecycle. Given the foundational nature of the subject matter, the target audience of this article includes data scientists, AI product managers, engineers, UX researchers and designers, managers, and all others aspiring to develop great AI products.
Note: All figures and formulas in the following sections have been created by the author of this article.
Expected Value
Before looking at a formal definition of expected value, let us consider two simple games to build our intuition.
A Game of Dice
In the first game, imagine you are competing with your friends in a dice-rolling contest. Each of you gets to roll a fair, six-sided die N times. The score for each roll is given by the number of pips (dots) showing on the top face of the die after the roll; 1, 2, 3, 4, 5, and 6 are thus the only achievable scores for any given roll. The player with the highest total score at the end of N rolls wins the game. Assuming that N is a large number (say, 500), what should we expect to see at the conclusion of the game? Will there be an outright winner or a tie?
It turns out that, as N gets large, the total scores of each of the players are likely to converge to 3.5*N. For example, after 500 rolls, the total scores of you and your friends are likely to be around 3.5*500 = 1750. To see why, notice that, for a fair, six-sided die, the probability of any side being on top after a roll is 1/6. On average, the score of an individual roll will therefore be (1 + 2 + 3 + 4 + 5 + 6)/6 = 3.5, i.e., the average of all achievable scores per roll — this also happens to be the expected value of a die roll. Assuming that the outcomes of all rolls are independent of each other, we would expect the average score of the N rolls to be 3.5. So, after 500 rolls, we should not be surprised if each player has a total score of roughly 1750. In fact, there is a so-called strong law of large numbers in mathematics, which states that if you repeat an experiment (like rolling a die) a sufficiently large number of times, the average result of all those experiments should converge almost surely to the expected value.
A Game of Roulette
Next, let us consider roulette, a popular game at casinos. Imagine you are playing a simplified version of roulette against a friend as follows. The roulette wheel has 38 pockets, and the game ends after N rounds. For each round, you must pick a whole number between 1 and 38, after which your friend will spin the roulette wheel and throw a small ball onto the spinning wheel. Once the wheel stops spinning, if the ball ends up in the pocket with the number that you picked, your friend will pay you $35; if the ball ends up in any of the other pockets, however, you must pay your friend $1. How much money do you expect you and your friend to make after N rounds?
You might think that, since $35 is a lot more than $1, your friend will end up paying you quite a bit of money by the time the game is done — but not so fast. Let us apply the same basic approach we used in the dice game to analyze this seemingly lucrative game of roulette. For any given round, the probability of the ball ending up in the pocket with the number that you picked is 1/38. The probability of the ball ending up in some other pocket is 37/38. From your perspective, the average outcome per round is therefore $35*1/38 – $1*37/38 = -$0.0526. So, it seems that you will actually end up owing your friend a little over a nickel after each round. After N rounds, you will be out of pocket by around $0.0526*N. If you play 500 rounds, as in the dice game above, you will end up paying your friend roughly $26. This is an example of a game that is rigged to favor the “house” (i.e., the casino, or in this case, your friend).
Formal Definition
Let X be a random variable that can yield any one of k outcome values, x1, x2, …, xk, each with probabilities p1, p2, …, pk of occurring, respectively. The expected value, E(X), of X is the sum of the outcome values weighted by their respective probabilities of occurrence:

The total expected value of N independent occurrences of X will be N*E(X).
The video below walks through some more hands-on examples of expected value calculations:
In the following case studies, we will see how expected value analysis can aid decision making under uncertainty. Fictitious company names are used throughout to preserve the anonymity of the businesses involved.
Case Study 1: Fraud Detection in E-Commerce
Cars Online is an online platform for reselling used cars across Europe. Legitimate car dealerships and private owners of used cars can list their vehicles for sale on Cars Online. A typical listing will include the asking price of the seller, facts about the car (e.g., its basic properties, special features, and details of any damages/wear-and-tear), and photos of the car’s interior and exterior. Buyers can browse through the many listings on the platform, and having found one they like, can click on a button on the listing page to contact the seller to arrange a viewing, and ultimately make the purchase. Cars Online charges sellers a small monthly fee to show listings on the platform. To drive such subscription-based revenue, the process for sellers to sign up for the platform and create listings is kept as simple as possible.
The trouble is that some of the listings on the platform may in fact be fake. An unintended consequence of reducing the barriers for creating listings is that malicious users can set up fake seller accounts and create fake listings (often impersonating legitimate car dealerships) to lure and potentially defraud unsuspecting buyers. Fake listings can have a negative business impact on Cars Online in two ways. First, fearing reputational damage, affected sellers may take their listings to other competing platforms, publicly criticize Cars Online for its apparently lax security standards (which might trigger other sellers to also leave the platform), and even sue for damages. Second, affected buyers (and those that hear about the instances of fraud in the press, on social media, and from friends and family) may also abandon the platform, and write negative reviews online — all of which can further persuade sellers (the platform’s key revenue source) to leave.
Against this backdrop, the chief product officer (CPO) at Cars Online has tasked a product manager and a cross-functional team of customer success representatives, data scientists, and engineers to assess the possibility of using AI to combat the scourge of fraudulent listings. The CPO is not interested in mere opinions — she wants a data-driven estimate of the net value of implementing an AI system that can help quickly detect and delete fraudulent listings from the platform before they can cause any damage.
Expected value analysis can be used to estimate the net value of the AI system by considering the probabilities of correct and incorrect predictions and their respective benefits and costs. In particular, we can distinguish between four cases: (1) correctly detected fake listings (true positives), (2) legitimate listings incorrectly deemed fake (false positives), (3) correctly detected legitimate listings (true negatives), and (4) fake listings incorrectly deemed legitimate (false negatives). The net monetary impact, C(i), of each case i can be estimated with the help of historical data and stakeholder interviews. Both true positives and false positives will result in some effort for Cars Online to remove the identified listings, but the false positives will result in additional costs (e.g., revenues lost due to removing legitimate listings and the cost of efforts to reinstate these). Meanwhile, whereas true negatives should incur no costs, false negatives can be expensive — these represent the very fraud that the CPO aims to combat.
Given an AI model with a certain predictive accuracy, if P(i) denotes the probability of each case i occurring in practice, then the sum S = C(1)*P(1) + C(2)*P(2) + C(3)*P(3) + C(4)*P(4) reflects the expected value of each prediction (see Figure 1 below). The total expected value for N predictions would then be N*S.

Based on the predictive performance profile of a given AI model and estimates of expected value for each of the four cases (from true positives to false negatives), the CPO can get a better sense of the expected value of building an AI system for fraud detection and make a go/no-go decision for the project accordingly. Of course, additional fixed and variable costs usually associated with building, operating, and maintaining AI systems should also be factored into the overall decision making.
This article considers a similar case study, in which a recruiting agency decides to implement an AI system for identifying and prioritizing good leads (candidates likely to be hired by clients) over bad ones. Readers are encouraged to go through that case study and reflect on the similarities and differences with the one discussed here.
Case Study 2: Auto-Completing Purchase Orders
The procurement department of ACME Auto, an American car manufacturer, creates a significant number of purchase orders every month. Building a single car requires several thousand individual parts that need to be procured on time and at the right quality standard from approved suppliers. A team of purchasing clerks is responsible for manually creating the purchase orders; this involves filling out an online form consisting of several data fields that define the precise specifications and quantities of each item to be purchased per order. Needless to say, this is a time-consuming and error-prone activity, and as part of a company-wide cost-cutting initiative, the Chief Procurement Officer of ACME Auto has tasked a cross-functional product team within her department to substantially automate the creation of purchase orders using AI.
Having conducted user research in close collaboration with the purchasing clerks, the product team has decided to build an AI feature for auto-filling fields in purchase orders. The AI can auto-fill fields based on a combination of any initial inputs provided by the purchasing clerk and other relevant information sourced from master data tables, inputs from production lines, and so on. The purchasing clerk can then review the auto-filled order and has the option of either accepting the AI-generated proposals (i.e., predictions) for each field or overriding incorrect proposals with manual entries. In cases where the AI is unsure of the correct value to fill (as exemplified by a low model confidence score for the given prediction), the field is left blank, and the clerk must manually fill it with a suitable value. An AI feature for flexibly auto-filling forms in this manner can be built using an approach called denoising, as described in this article.
To ensure high quality, the product team would like to set a threshold for model confidence scores, such that only predictions with confidence scores above this predefined threshold are shown to the user (i.e., used to auto-fill the purchase order form). The question is: what threshold value should be chosen?
Let c1 and c2 be the payoffs of showing correct and incorrect predictions to the user (due to being above the confidence threshold), respectively. Let c3 and c4 be the payoffs of not showing correct and incorrect predictions to the user (due to being below the confidence threshold), respectively. Presumably, there should be a positive payoff (i.e., a benefit) to showing correct predictions (c1) and not showing incorrect ones (c4). By contrast, c2 and c3 should be negative payoffs (i.e., costs). Picking a threshold that is too low increases the chance of showing wrong predictions that the clerk must manually correct (c2). But picking a threshold that is too high increases the chance of correct predictions not being shown, leaving blank fields on the purchase order form that the clerk would need to spend some effort to manually fill in (c3). The product team thus has a trade-off on its hands — can expected value analysis help resolve it?
As it happens, the team is able to estimate reasonable values for the payoff factors c1, c2, c3, and c4 by leveraging findings from user research and business domain know-how. Furthermore, the data scientists on the product team are able to estimate the probabilities of incurring these costs by training an example AI model on a dataset of historical purchase orders at ACME Auto and analyzing the results. Suppose k is the confidence score attached to a prediction. Then given a predefined model confidence threshold t, let q(k > t) denote the proportion of predictions that have confidence scores greater than t; these are the predictions that would be used to auto-fill the purchase order form. The proportion of predictions with confidence score below the threshold value is q(k ≤ t) = 1 – q(k > t). Furthermore, let p(k > t) and p(k ≤ t) denote the average accuracies of predictions that have confidence scores greater than t and at most t, respectively. The expected value (or expected payoff) S per prediction can be derived by summing up the expected values attributable to each of the four payoff drivers (denoted s1, s2, s3, and s4), as shown in Figure 2 below. The task for the product team is then to test various threshold values t and identify one that maximizes the expected payoff S.

Case Study 3: Standardizing AI Design Guidance
The CEO of Ex Corp, a global enterprise software vendor, has recently declared her intention to make the company “AI-first” and infuse all of its products and services with high-value AI features. To support this company-wide transformation effort, the Chief Product Officer has tasked the central design team at Ex Corp with creating a consistent set of design guidelines to help teams build AI products that enhance user experience. A key challenge is managing the trade-off between creating guidance that is too weak/high-level (giving individual product teams greater freedom of interpretation while risking inconsistent application of the guidance across product teams) and guidance that is too strict (enforcing standardization across product teams without due regard for product-specific exceptions or customization needs).
One well-intentioned piece of guidance that the central design team initially came up with involves displaying labels next to predictions on the UI (e.g., “best option,” “good alternative,” or similar), to give users some indication of the expected quality/relevance of the predictions. It is thought that showing such qualitative labels would help users make informed decisions during their interactions with AI products, without overwhelming them with hard-to-interpret statistics such as model confidence scores. In particular, the central design team believes that by stipulating a consistent, global set of model confidence thresholds, a standardized mapping can be created for translating between model confidence scores and qualitative labels for products across Ex Corp. For example, predictions with confidence scores greater than 0.8 can be labeled as “best,” predictions with confidence scores between 0.6 and 0.8 can be labeled as “good,” and so on.
As we have seen in the previous case study, it is possible to use expected value analysis to derive a model confidence threshold for a specific use case, so it is tempting to try to generalize this threshold across all use cases in the product portfolio. However, this is trickier than it first seems, and the probability theory underlying expected value analysis can help us understand why. Consider two simple games, a coin flip and a die roll. The coin flip entails two possible outcomes, landing heads or tails, each with a 1/2 probability of occurring (assuming a fair coin). Meanwhile, as we discussed previously, rolling a fair, six-sided die entails six possible outcomes for the top-facing side (1, 2, 3, 4, 5, or 6 pips), each with a 1/6 probability of occurring. A key insight here is that, as the number of possible outcomes of a random variable (also called the cardinality of the outcome set) increases, it generally becomes harder and harder to correctly guess the outcome of an arbitrary event. If you guess that the next coin flip will result in heads, you will be right half the time on average. But if you guess that you will roll any particular number (say, 3) on the next die roll, you will only be correct one out of six times on average.
Now, what if we were to set a global confidence threshold of, say, 0.4 for both the coin and dice games? If an AI model for the dice game predicts a 3 on the next roll with a confidence score of 0.45, then we might happily label this prediction as “good” or even “great”; after all, the confidence score is above the predefined global threshold and significantly higher than 1/6 (the success probability of a random guess). However, if an AI model for the coin game predicts heads on the next coin flip with the same confidence score of 0.45, we may suspect that this is a false positive and not show the prediction to the user at all; although the confidence score is above the predefined threshold, it is still below 0.5 (the success probability of a random guess).
The above analysis suggests that a single, one-size-fits-all stipulation to display qualitative labels next to predictions should be struck from the standardized design guidance for AI use cases. Instead, perhaps individual product teams should be empowered to make use-case-specific decisions about how to display qualitative labels (if at all).
The Wrap
Decision making under uncertainty is a key concern for AI product teams, and will likely gain in importance in a future dominated by AI. In this context, expected value analysis can help guide AI product management. The expected value of an uncertain outcome represents the theoretical, long-term, average value of that outcome. Using real-life case studies, this article shows how expected value analysis can help teams make educated, strategic decisions under uncertainty across the product lifecycle.
As with any such mathematical modeling approach, however, it is worth emphasizing two important points. First, an expected value calculation is only as good as its structural completeness and the accuracy of its inputs. If all relevant value drivers are not included, the calculation will be structurally incomplete, and the resulting findings will be inaccurate. Using conceptual frameworks such as the matrices and tree diagrams shown in Figures 1 and 2 above can help teams verify the completeness of their calculations. Readers can refer to this book to learn how to leverage conceptual frameworks. If the data and/or assumptions used to derive the outcome values and their probabilities are faulty, then the resulting expected value will be inaccurate, and potentially damaging if used to inform strategic decision making (e.g., wrongly sunsetting a promising product). Second, it is usually a good idea to pair a quantitative approach like expected value analysis with qualitative approaches (e.g., customer interviews, observing how users interact with the products) to get a well-rounded picture. Qualitative insights can help us do sanity checks of inputs to the expected value calculation, better interpret the quantitative results, and ultimately derive holistic recommendations for decision making.


