Developing an Elo Based, Data-Driven Rating System for 2v2 Multiplayer Games

[ad_1]

From friendly matches to intense competition, foosball has found its niche in corporate culture, providing a unique way for teams to connect and compete.

This article explores the math behind a 2v2 Elo-based scoring system that can be applied to foosball or any other 2v2 game. It also examines the architecture that supports data processing, and presents the creation of a web application that provides real-time ranking and data analysis using Python.

The Elo rating system is a method used to determine the relative skill level of a player in a zero-sum games. It was first developed for chess but is now being applied as a rating system in a variety of other sports such as baseball, basketball, various board games and e-sports.

One well-known example of this system is in chess, where the Elo rating system is employed to rank players worldwide. Magnus Carlsen, also known as the “Mozart of Chess”, holds the highest Elo rating in the world with a rating of 2,853 in 2023, demonstrating his extraordinary skills in the game.

The Elo rating formula is a two-part formula: first, it calculates the expected outcome for a given group of players, and then it determines the rating adjustment based on the outcome of the match and the expected outcome.

Expected Outcome Calculation

Consider the following example in chess with Player A and Player B with ratings R𝖠 and R𝖡 respectively. The equation for the expected score of Player A against Player B is the following:

The Elo algorithm uses a variable that can be adjusted to control how the winning probability is influenced by the players’ ratings. In this example, it is set to 400, which is typical for most sports, including chess.

Now let’s take a look at a more realistic example, where player A has a rating of 1,500 and Player B, 1,200.

The same equation seen above can calculate Player A’s expected score against Player B:

With this calculation, we know that Player A has a 84.9% chance of winning against Player B.

To find the estimated probability of Player B winning against Player A, the same formula is used, but the order of ratings is reversed:

The sum of the probabilities of Player A winning and Player B winning equals 1 (0.849 + 0.151 = 1). In this scenario, Player A therefore has an 84.9% chance of winning, leaving Player B with only a 15.1% chance.

Rating Calculation

The difference in rating between the winner and the loser determines the total number of points won or lost after each game.

If a player with a much higher Elo rating wins, they will receive fewer points for their victory, and their opponent will lose only a few points for their defeat.
By contrast, if the lower-ranked player wins, this achievement is considered much more significant, thus the reward is greater and the higher-ranked opponent is penalized accordingly.

The formula to calculate the new rating of Player A playing against Player B is the following:

In this formula, ( S𝖠 — E𝖠 ) represents the difference between Player A’s actual score and the expected score. The additional variable K determines approximately how much a player’s rating can change after a single match. In chess, this variable is set to 32.

If Player A wins, the actual score, which is 1 in this case, will be greater than the expected score of 0.849, creating a positive variance.

This indicates that Player A performed better than initially expected. As a result, the Elo rating system recalibrates the ratings for both players:

Player A’s rating will increase because of the win
Player B’s rating will decrease because of the loss

Once again, this same equation can calculate the new rating of Player A and Player B:

In summary, the Elo rating system offers a robust and efficient method for evaluating and comparing players’ skills dynamically and fairly. It continually updates a player’s rating after each match, considering the skill difference between the two opponents.

This approach rewards risk-taking, as winning against a higher-rated player results in a more significant increase in a player’s rating, as shown in the table below:

FIGURE I : Example of the Elo System in Chess | Table by the author

On the other hand, if a higher-rated player goes against their winning probability and loses against a lower-rated player, their rating will be significantly impacted: they will lose more points, and their opponent will gain more points.

In summary, when a player wins a match, the lower their winning probability is, the higher the amount of points they can win.

In its current state, this rating formula, originally designed for chess, is not fully adapted to foosball.

In fact, foosball does have more variables than chess such as:

It is a four-player game with teams of two (2v2)
Each team member can positively or negatively influence their teammate
Unlike the binary outcome in chess, the scale of victory or defeat in foosball can vary considerably depending on the teams’ scores

The focus here is on adapting the Elo rating system to the unique requirements of foosball games, involving four players divided into two teams.

Winning Probability

To begin calculating new player ratings, a refined formula needs to be established to determine the expected outcome of a game involving four players in two teams.

To demonstrate this, consider a hypothetical four-player foosball game scenario: Player 1, Player 2, Player 3, and Player 4, each with a different rating that represents their skill level.

FIGURE II: Scenario with 4 Players Playing Foosball | Table by the author

To calculate the expected score of Team 1 against Team 2 in the revised Elo rating system, the expected score of each player involved in the game needs to be determined.

Player 1’s expected rating, denoted by E𝖯𝟣, can be calculated by averaging the sum of each opponent’s rating using the Elo rating formula as follows:

After extensive testing, it was decided that it would be appropriate for the expected score formula to set the variable used to divide the rating difference to 500, rather than the traditional value of 400 used in chess. This increased value means that a player’s rating will have a smaller impact on their expected score.

A primary reason for this adjustment is that, unlike chess, there is a slight element of chance in foosball. By using a value of 500, the game outcomes can be more accurately predicted, and a reliable rating system can be developed.

To calculate the expected score of Player 2 denoted by E𝖯𝟤, against Player 3 and Player 4, the same method as utilized for Player 1 can be employed.

The expected score of the Team denoted E𝖳𝟣 can then be calculated by taking the average of E𝖯𝟣 and E𝖯𝟤 :

Once the expected scores for each player are computed, they can then be used to calculate the outcome of the match. The team with the highest expected score is more likely to win. By averaging the expected scores for each team member, the issue of skill differences within the team can then be solved !

The table below shows the expected scores of Player 1 and 2 against Players 3 and 4.

P1’s expected scores against P3 and P4 are 0.091 and 0.201, corresponding to a 14.6% chance of winning
P2’s expected scores against P3 and P4 are 0.201 and 0.387, giving a combined winning probability of 29.4%
For P1, partnering with a stronger player like P2 can increase their overall chances of winning, as demonstrated by the 22%

FIGURE III : Expected Score Based on the Scenario Shown in Figure II | Table by the author

If the team of P1 and P2 wins, P1 gains fewer points than their individual expected score would suggest, as P2, who is higher ranked, also contributes to the win and lowers their overall winning probability.

On the other hand, P2 gains more points due to having a lower- ranked teammate. In case of a win, P2 is rewarded for taking a risk, while P1 earns fewer points, as it is assumed P2 contributed more significantly to the victory, and vice versa if they lose.

Rating Parameters

Now that the expected outcome of a four-player match has been determined, this information can be incorporated into a new formula that considers multiple variables that affect the match and player ratings.

As discussed earlier, the K-value can be modified to better fit the needs of the rating system. This new formula considers the number of games played by each player, reflecting their seniority as well as the result of the game.

For example, in the 2014 World Cup semi-final, Germany defeated Brazil by a score of 7–1. This was one of the most shocking and humiliating results in World Cup history, as Brazil was the host nation and had never lost a competitive match at home since 1975.

If we were to apply the rating system to this match, we would expect Germany to gain a significant amount of points, while Brazil would lose a large amount of points, reflecting the difference in their performance and skill level.

K-Value
The K-rating, denoted as K𝟣 for Player 1 in this case, determines how much a player’s rating will change after one game. This revised K-value takes into account the number of games the player has played to balance the effect of each game on their rating. After conducting numerous tests, a formula was developed for calculating the K-value for each player.

For Player 1, this is expressed as:

This formula for the K-value is designed to have a greater impact on the rating for new players while providing stability and less rating fluctuation for experienced players. Specifically, after playing 300 games, a player’s rating becomes more representative of their skill level.

FIGURE IV: K-value Over Time | Chart by the author

Figure IV shows the effect of the number of games played on the K-value. Starting at 50, this graph shows that the K-value decreases as the number of games played increases, reaching a halved value of 25 after 300 games. This ensures that the impact of each game on a player’s rating decreases as experience increases.

Point Factor
To consider the points scored by each team, a new variable, called the “point factor”, was introduced into the equation. This factor multiplies the K parameter of each player and is based on the absolute difference in points between the two teams. The impact of a match must be greater when a team wins by a large margin, i.e., an overwhelming victory.

To calculate the point factor, the following formula was used:

This formula takes the absolute difference between the scores of the two teams, adds 1, and computes the base-10 logarithm of the result. This value is then cubed and 2 is added to the result to obtain the final value of the point factor.

FIGURE V: Point Factor | Chart by the author

Final Rating Calculation

After adjusting all the necessary variables, an improved formula was developed to calculate the new ranking of each player involved in a game.

Each player’s rating now takes into account their previous rating, the rating of their opponents, the impact of their teammates, their playing history, and the score of the game. This formula ensures that each player is rewarded according to their true performance, taking into account the fairness of each match.

Going off of the previous example, the new formula for player A’s ranking is the following:

This improved formula rewards players based on their actual performance, encourages risk taking and provides a more balanced rating system for both new and experienced players.

Now that we have an Elo algorithm, we can move on to database modeling.

The proposed database model adopts a relational approach, organizing data into interconnected tables through the use of Primary Keys (PKs) and Foreign Keys (FKs). This structured organization facilitates data management and analysis, making PostgreSQL an appropriate choice as the database management system. PKs and FKs help maintain data consistency and minimize redundancy within the database.

FIGURE VI: Diagram Model of the Database | Image by the author

Two types of relationships exist between tables in this database model: one-to-many and many-to-many.

The relationship between the ‘Player’ table and the ‘Match’ table is many-to-many since a player can participate in numerous matches, and multiple players can be involved in a single match. A junction table called ‘PlayerMatch’ bridges this relationship, containing two foreign keys: ‘player_id’ (referencing the participating player) and ‘match_id’ (referencing the corresponding match).

This structure ensures the accurate association of players and matches as demonstrated in the code below:

CREATE TABLE PlayerMatch (
player_match_id serial PRIMARY KEY,
player_id INT NOT NULL REFERENCES Player(player_id),
match_id INT NOT NULL REFERENCES Match(match_id)
);

A similar logic applies to the ‘TeamMatch’ table, which serves as a junction between the ‘Match’ and ‘Team’ tables, allowing multiple teams to play one match and one match to involve multiple teams.

Separate tables for ‘PlayerRating’ and ‘TeamRating’ have been designed to streamline ranking analysis over time. These tables connect to the ‘PlayerMatch’ and ’TeamMatch’ tables respectively through ‘player_match_id’ and ‘team_match_id’.

Data Integrity

In addition to the use of PKs and FKs, this database model also uses appropriate data types and CHECK constraints for data integrity:

The ‘winning_team_score’ and ‘losing_team_score’ columns in the ‘Match’ table are integers, preventing non-numeric entries
CHECK constraints enforce that the ‘winning_team_score’ is exactly 11
CHECK constraints enforce that the ‘losing_team_score’ is between 0 and 10, adhering to the game rules

As seen in the code chunk below, the use of sequences for each primary key has been implemented in the database creation to facilitate data entry. This automation simplifies the overall procedure when later using the Python loop for the data entry process.

CREATE SEQUENCE player_id_seq START 1;
CREATE SEQUENCE team_id_seq START 1;
CREATE SEQUENCE match_id_seq START 1;
CREATE SEQUENCE player_match_id_seq START 1;
CREATE SEQUENCE player_rating_id_seq START 1;
CREATE SEQUENCE team_match_id_seq START 1;
CREATE SEQUENCE team_rating_id_seq START 1;

Data Processing

The main challenge was to find a way to process the match data in a sequence that would allow for the retrieval of the IDs from the initial data that was being processed and inserted into the database.

These particular IDs could then serve as foreign keys to manage the remaining data, creating the necessary relationships in the process. In other words, the first step was to identify and store specific data (IDs) from the raw data, and then use these IDs as a bridge to link and process the rest of the data.

The data was processed step by step, using increasingly complex Python loops. Each new entry was assigned a unique primary key generated from the table’s sequence.

The first step was to handle the individual players and obtain their IDs.
Next, teams were processed using the player IDs. For each unique pair of players in a match, an entry was created in the ‘Team’ table (FK players)
Following this, matches were handled using the winning and losing team IDs. After processing the matches, the ‘PlayerMatch’ and ‘TeamMatch’ tables were addressed by retrieving the corresponding match, player, and team IDs
Once all the necessary data had been processed, the ‘PlayerMatch’ and ‘TeamMatch’ IDs, along with the ‘match’ timestamps, were used in the ‘PlayerRating’ and ’TeamRating’ tables to track the evolution of ratings over time.

The objective of the web application is to allow users to enter game results, verify data, and interact directly with the database. This ensures that the data is up-to-date and offered in real time so that users are always able to access ranking or visualize their metrics.

Additionally, I wanted to make the web app mobile-friendly, because who would want to drag a laptop around to play foosball? That would not be very practical or fun.

Technology Stack

Backend
After comparing Django and Flask, two popular web frameworks for building web applications in Python, Flask was chosen for its beginner-friendly approach. The Flask web framework is used to handle user requests, process data, and interact with the PostgreSQL database.

Frontend
The frontend consists of static HTML and CSS files, which define the structure and styling of the web application. JavaScript is used for form validation and handling user interactions. This ensures that the data submitted by users is consistent and accurate before being sent to the backend.

Data Visualization
When it comes to data visualization, the biggest challenge is having up-to-date data. To overcome this limitation, the data visualization layer uses Plotly, a Python library, to generate interactive charts and graphs that visualize player ratings over time. This component receives data from the backend, processes it, and presents it to users in a user-friendly format.

Database
PostgreSQL was used for both the local development environment as well as the production environment on AWS, via Heroku. Automatic database backups are facilitated by Heroku, ensuring that data is protected and can be easily restored if necessary.

UI/UX Research

For the UI/UX design, inspiration was drawn from the modern web designs of Spotify and the new Bing search engine. The goal was to create a familiar and intuitive user experience.

FIGURE VII: Mockup of the Application | Image by the author

Let’s dive into the features of the application with a concrete scenario. Team 1 (Matthieu and Gabriel) wants to play against Team 2 (Wissam and Malik). All players have a different rating that is representative of their skill level, shown below.

Calculate Odds

The first thing players want to do before any match is to calculate their winning probability.

To do so, the “Calculate Odds” view allows users to select four players using the drop-down menu and generate the winning probability for the selected teams.

FIGURE VIII: Calculate Odds | Image by the author

This feature is primarily used before a game to verify that a match is balanced and to inform players about their winning probability. For example, Team 1 has a higher chance of winning (64.19%) than Team 2 who has a 35.81% chance of winning. This view informs each player of the stakes and the risk taken.

Once the form is submitted, the application computes only the first part of the algorithm, which consists of calculating the expected outcome of a game given the four selected players.

Upload a game

The “Upload a Game” view serves the home page of the application. It is designed for user convenience, allowing them to upload a game immediately upon opening the app.

Before the form is submitted, the application performs data validation using JavaScript to ensure:

Four different players are selected
Scores are non-negative integers
There is only one winning team with a score of exactly 11, with no draws allowed

When the validation is successful, the application processes the data using the full algorithm, updates the corresponding tables in the database, and gives users a confirmation of their upload.

The “Match Uploaded” view is designed to show users the effect of each match on their individual ratings. It calculates the difference between the players’ ratings before and after the match was uploaded.

As shown above, the game does not have the same effect on each player’s rating. This is because of the individual parameters of the algorithm on each player: their expected score, their number of games, their teammate and the opposing team.

Elo Ranking

The “Player Ranking” view allows users to access the real-time monthly ranking and compare themselves with other players. Users can see their rating, the number of games they played throughout the month, and the last game they played showcasing their latest rating.

FIGURE X: Player Ranking | Image by the author

Once the “Player Ranking” view is accessed or a new period is submitted, the application queries the database using a CTE approach.

This involves joining all necessary tables and displaying the most recent ranking update, using the period selector to filter the query:

def get_latest_player_ratings(month=None, year=None):
now = datetime.now()
default_month = now.month
default_year = now.year
selected_year = int(year) if year else default_year
selected_month = int(month) if month else default_month
start_date = f'{selected_year}-{selected_month:02d}-01 00:00:00'
end_date = f'{selected_year}-{selected_month:02d}-{get_last_day_of_month(selected_month, selected_year):02d} 23:59:59'query = '''
WITH max_player_rating_timestamp AS (
SELECT 
pm.player_id,
MAX(pr.player_rating_timestamp) as max_timestamp
FROM PlayerMatch pm
JOIN PlayerRating pr ON pm.player_match_id = pr.player_match_id
WHERE pr.player_rating_timestamp BETWEEN %s AND %s
GROUP BY pm.player_id
),
filtered_player_match AS (
SELECT 
pm.player_id,
pm.match_id
FROM PlayerMatch pm
JOIN max_player_rating_timestamp mprt ON pm.player_id = mprt.player_id
),
filtered_matches AS (
SELECT match_id
FROM Match
WHERE match_timestamp BETWEEN %s AND %s
)
SELECT 
CONCAT(p.first_name, '.', SUBSTRING(p.last_name FROM 1 FOR 1)) as player_name, 
pr.rating, 
COUNT(DISTINCT fpm.match_id) as num_matches,
pr.player_rating_timestamp
FROM Player p
JOIN max_player_rating_timestamp mprt ON p.player_id = mprt.player_id
JOIN PlayerMatch pm ON p.player_id = pm.player_id
JOIN PlayerRating pr ON pm.player_match_id = pr.player_match_id
AND pr.player_rating_timestamp = mprt.max_timestamp
JOIN filtered_player_match fpm ON p.player_id = fpm.player_id
JOIN filtered_matches fm ON fpm.match_id = fm.match_id
GROUP BY p.player_id, pr.rating, pr.player_rating_timestamp
ORDER BY pr.rating DESC;
'''

The primary goal in developing this comprehensive solution was to provide users with a real-time ranking system that serves as a visual representation of each player’s performance.

Although powerful tools like PowerBI and Qlik are available for data visualization, a fully mobile-compatible solution was chosen, allowing users to gain real-time insights on their devices without incurring licensing fees.

Two methods were utilized to achieve this:

First, Dash Plotly, a Python framework that enables developers to build interactive, data- driven applications on top of Flask applications, was used
Second, various SQL queries and static HTML pages were employed to pull information from the database and display it, ensuring that users always have access to real-time data

Rating Evolution

This visualization allows players to observe the impact of each game on their ranking and to identify broader trends. For example, they can see exactly when someone overtakes them or see the impact of consecutive wins or losses.

When accessing the “Rating Evolution” view, the application performs a query on the database for each selected player, retrieving the most recent ranking update for each day a game was played:

SELECT DISTINCT ON (DATE_TRUNC('day', m.match_timestamp))
DATE_TRUNC('day', m.match_timestamp) AS day_start,
CASE WHEN p.first_name = '{player}' THEN pr.rating ELSE NULL END AS rating
FROM PlayerMatch pm
JOIN Player p ON pm.player_id = p.player_id
JOIN PlayerRating pr ON pm.player_match_id = pr.player_match_id
JOIN Match m ON pm.match_id = m.match_id
WHERE p.first_name = '{player}'
ORDER BY DATE_TRUNC('day', m.match_timestamp) DESC, m.match_timestamp DESC

The retrieved data table is then transformed into a line chart, with the columns converted into axes using Dash.

To reduce the database load and simplify the data presentation in the chart, only the latest rating update is displayed for each day.

Player Metrics

Inspired by Spotify Wrapped, the idea is to provide insights derived from constant data collection. While there is immense potential to visualize player insights, the focus is on metrics that highlight individual performance and connections between players.

These metrics are organized into three color-coded categories: partner, games, and rivals, with each metric accompanied by a title, a value, and a sub-measure for more detail.

Game Metrics
These metrics are centered on the screen and displayed in blue for neutrality. They include the total number of games played since data collection began.

Partner Metrics
The partner metrics appear on the left side of the screen. They are displayed in green because of their positive connotation.

The top box highlights the primary partner with whom the selected player has played the most games
The second metric identifies the player’s best partner. This is defined by the highest winning percentage
The third metric in this category is the selected player’s worst partner This is calculated based on the lowest win percentage (or highest loss percentage)

Rival Metrics
Rival metrics are displayed in red to indicate opposition. Rival metrics represent the competitive relationship between players.

The top box shows the most common opponent, with a sub-metric indicating the number of games played together, similar to the partner metrics
The second metric, “Easiest Rival”, represents the opponent against whom the player has the highest win rate. This indicates a weaker opponent
The final metric is the player against whom the selected player has the lowest win rate. This metric indicates the most difficult opponent

As I write this, it’s been 6 months that the application has been in use, and these are the results so far:

This ranking system based on the Elo system predicts match results and accurately ranks players based on their actual performance
Players have become more competitive, as they are now increasingly aware of their performance due to data visualization
Players have become more inclusive thanks to an improved formula that rewards players who take risks. Players who wouldn’t normally play together now have the incentive to pair up

By adopting a data-driven strategy, this project has highlighted the profound influence and importance of data.

Going beyond simple analysis of player performance, this project has initiated a transformation in the way players approach foosball games and interact with other players as well as newcomers. The power of data has truly cultivated a more inclusive and competitive environment.

[ad_2]

Source link

Developing an Elo Based, Data-Driven Rating System for 2v2 Multiplayer Games | by Lazare Kolebka