, you should take testing your code seriously. You might write unit tests with pytest, mock dependencies, and strive for high code coverage. If you’re like me, though, you might have a nagging question lingering at the back of your mind after you finish coding a test suite.
“Have I thought of all the edge cases?”
You might test your inputs with positive numbers, negative numbers, zero, and empty strings. But what about weird Unicode characters? Or floating-point numbers that are NaN or infinity? What about a list of lists of empty strings or complex nested JSON? The space of possible inputs is huge, and it’s hard to think of the myriad different ways your code could break, especially if you’re under some time pressure.
Property-based testing flips that burden from you to the tooling. Instead of hand-picking examples, you state a property — a truth that must hold for all inputs. The Hypothesis library then generates inputs; several hundred if required, hunts for counterexamples, and — if it finds one — shrinks it to the simplest failing case.
In this article, I’ll introduce you to the powerful concept of property-based testing and its implementation in Hypothesis. We’ll go beyond simple functions and show you how to test complex data structures and stateful classes, as well as how to fine-tune Hypothesis for robust and efficient testing.
So, what exactly is property-based testing?
Property-based testing is a methodology where, instead of writing tests for specific, hardcoded examples, you define the general “properties” or “invariants” of your code. A property is a high-level statement about the behaviour of your code that should hold for all valid inputs. You then use a testing framework, like Hypothesis, which intelligently generates a wide range of inputs and tries to find a “counter-example” — a specific input for which your stated property is false.
Some key aspects of property-based testing with Hypothesis include:
- Generative Testing. Hypothesis generates test cases for you, from the simple to the unusual, exploring edge cases you would likely miss.
- Property-Driven. It shifts your mindset from “what is the output for this specific input?” to “what are the universal truths about my function’s behaviour?”
- Shrinking. This is Hypothesis’s killer feature. When it finds a failing test case (which might be large and complex), it doesn’t just report it. It automatically “shrinks” the input down to the smallest and simplest possible example that still causes the failure, often making debugging dramatically easier.
- Stateful Testing. Hypothesis can test not just pure functions, but also the interactions and state changes of complex objects over a sequence of method calls.
- Extensible Strategies. Hypothesis provides a robust library of “strategies” for generating data, and allows you to compose them or build entirely new ones to match your application’s data models.
Why Hypothesis Matters / Common Use Cases
The primary benefit of property-based testing is its ability to find subtle bugs and increase your confidence in the correctness of your code far beyond what’s possible with example-based testing alone. It forces you to think more deeply about your code’s contracts and assumptions.
Hypothesis is particularly effective for testing:
- Serialisation/Deserialisation. A classic property is that for any object x, decode(encode(x)) should be equal to x. This is perfect for testing functions that work with JSON or custom binary formats.
- Complex Business Logic. Any function with complex conditional logic is a great candidate. Hypothesis will explore paths through your code that you may not have considered.
- Stateful Systems. Testing classes and objects to ensure that no sequence of valid operations can put the object into a corrupted or invalid state.
- Testing against a reference implementation. You can state the property that your new, optimised function should always produce the same result as a more straightforward, known, exemplary reference implementation.
- Functions that accept complex data models. Testing functions that take Pydantic models, dataclasses, or other custom objects as input.
Setting up a development environment
All you need is Python and pip. We’ll install pytest as our test runner, hypothesis itself, and pydantic for one of our advanced examples.
(base) tom@tpr-desktop:~$ python -m venv hyp-env
(base) tom@tpr-desktop:~$ source hyp-env/bin/activate
(hyp-env) (base) tom@tpr-desktop:~$
# Install pytest, hypothesis, and pydantic
(hyp-env) (base) tom@tpr-desktop:~$ pip install pytest hypothesis pydantic
# create a new folder to hold your python code
(hyp-env) (base) tom@tpr-desktop:~$ mkdir hyp-project
Hypothesis is best run by using an established test runner tool like pytest, so that’s what we’ll do here.
Code example 1 — A simple test
In this simplest of examples, we have a function that calculates the area of a rectangle. It should take two integer parameters, both greater than zero, and return their product.
Hypothesis tests are defined using two things: the @given decorator and a strategy, which is passed to the decorator. Think of a strategy as the data types that Hypothesis will generate to test your function. Here’s a simple example. First, we define the function we want to test.
# my_geometry.py
def calculate_rectangle_area(length: int, width: int) -> int:
"""
Calculates the area of a rectangle given its length and width.
This function raises a ValueError if either dimension is not a positive integer.
"""
if not isinstance(length, int) or not isinstance(width, int):
raise TypeError("Length and width must be integers.")
if length <= 0 or width <= 0:
raise ValueError("Length and width must be positive.")
return length * width
Next is the testing function.
# test_rectangle.py
from my_geometry import calculate_rectangle_area
from hypothesis import given, strategies as st
import pytest
# By using st.integers(min_value=1) for both arguments, we guarantee
# that Hypothesis will only generate valid inputs for our function.
@given(
length=st.integers(min_value=1),
width=st.integers(min_value=1)
)
def test_rectangle_area_with_valid_inputs(length, width):
"""
Property: For any positive integers length and width, the area
should be equal to their product.
This test ensures the core multiplication logic is correct.
"""
print(f"Testing with valid inputs: length={length}, width={width}")
# The property we are checking is the mathematical definition of area.
assert calculate_rectangle_area(length, width) == length * width
Adding the @given decorator to the function turns it into a Hypothesis test. Passing the strategy (st.integers) to the decorator says that Hypothesis should generate random integers for the argument n when testing, but we further constrain that by ensuring neither integer can be less than one.
We can run this test by calling it in this manner.
(hyp-env) (base) tom@tpr-desktop:~/hypothesis_project$ pytest -s test_my_geometry.py
=========================================== test session starts ============================================
platform linux -- Python 3.11.10, pytest-8.4.0, pluggy-1.6.0
rootdir: /home/tom/hypothesis_project
plugins: hypothesis-6.135.9, anyio-4.9.0
collected 1 item
test_my_geometry.py Testing with valid inputs: length=1, width=1
Testing with valid inputs: length=6541, width=1
Testing with valid inputs: length=6541, width=28545
Testing with valid inputs: length=1295885530, width=1
Testing with valid inputs: length=1295885530, width=25191
Testing with valid inputs: length=14538, width=1
Testing with valid inputs: length=14538, width=15503
Testing with valid inputs: length=7997, width=1
...
...
Testing with valid inputs: length=19378, width=22512
Testing with valid inputs: length=22512, width=22512
Testing with valid inputs: length=3392, width=44
Testing with valid inputs: length=44, width=44
.
============================================ 1 passed in 0.10s =============================================
By default, Hypothesis will perform 100 tests on your function with different inputs. You can increase or decrease this by using the settings decorator. For example,
from hypothesis import given, strategies as st,settings
...
...
@given(
length=st.integers(min_value=1),
width=st.integers(min_value=1)
)
@settings(max_examples=3)
def test_rectangle_area_with_valid_inputs(length, width):
...
...
#
# Outputs
#
(hyp-env) (base) tom@tpr-desktop:~/hypothesis_project$ pytest -s test_my_geometry.py
=========================================== test session starts ============================================
platform linux -- Python 3.11.10, pytest-8.4.0, pluggy-1.6.0
rootdir: /home/tom/hypothesis_project
plugins: hypothesis-6.135.9, anyio-4.9.0
collected 1 item
test_my_geometry.py
Testing with valid inputs: length=1, width=1
Testing with valid inputs: length=1870, width=5773964720159522347
Testing with valid inputs: length=61, width=25429
.
============================================ 1 passed in 0.06s =============================================
Code Example 2 — Testing the Classic “Round-Trip” Property
Let’s look at a classic property:- serialisation and deserialization should be reversible. In short, decode(encode(X)) should return X.
We’ll write a function that takes a dictionary and encodes it into a URL query string.
Create a file in your hyp-project folder named my_encoders.py.
# my_encoders.py
import urllib.parse
def encode_dict_to_querystring(data: dict) -> str:
# A bug exists here: it doesn't handle nested structures well
return urllib.parse.urlencode(data)
def decode_querystring_to_dict(qs: str) -> dict:
return dict(urllib.parse.parse_qsl(qs))
These are two elementary functions. What could go wrong with them? Now let’s test them in test_encoders.py:
# test_encoders.py
# test_encoders.py
from hypothesis import given, strategies as st
# A strategy for generating dictionaries with simple text keys and values
simple_dict_strategy = st.dictionaries(keys=st.text(), values=st.text())
@given(data=simple_dict_strategy)
def test_querystring_roundtrip(data):
"""Property: decoding an encoded dict should yield the original dict."""
encoded = encode_dict_to_querystring(data)
decoded = decode_querystring_to_dict(encoded)
# We have to be careful with types: parse_qsl returns string values
# So we convert our original values to strings for a fair comparison
original_as_str = {k: str(v) for k, v in data.items()}
assert decoded == original_as_st
Now we can run our test.
(hyp-env) (base) tom@tpr-desktop:~/hypothesis_project$ pytest -s test_encoders.py
=========================================== test session starts ============================================
platform linux -- Python 3.11.10, pytest-8.4.0, pluggy-1.6.0
rootdir: /home/tom/hypothesis_project
plugins: hypothesis-6.135.9, anyio-4.9.0
collected 1 item
test_encoders.py F
================================================= FAILURES =================================================
_______________________________________ test_for_nesting_limitation ________________________________________
@given(data=st.recursive(
> # Base case: A flat dictionary of text keys and simple values (text or integers).
^^^
st.dictionaries(st.text(), st.integers() | st.text()),
# Recursive step: Allow values to be dictionaries themselves.
lambda children: st.dictionaries(st.text(), children)
))
test_encoders.py:7:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
data = {'': {}}
@given(data=st.recursive(
# Base case: A flat dictionary of text keys and simple values (text or integers).
st.dictionaries(st.text(), st.integers() | st.text()),
# Recursive step: Allow values to be dictionaries themselves.
lambda children: st.dictionaries(st.text(), children)
))
def test_for_nesting_limitation(data):
"""
This test asserts that the decoded data structure matches the original.
It will fail because urlencode flattens nested structures.
"""
encoded = encode_dict_to_querystring(data)
decoded = decode_querystring_to_dict(encoded)
# This is a deliberately simple assertion. It will fail for nested
# dictionaries because the `decoded` version will have a stringified
# inner dict, while the `data` version will have a true inner dict.
# This is how we reveal the bug.
> assert decoded == data
E AssertionError: assert {'': '{}'} == {'': {}}
E
E Differing items:
E {'': '{}'} != {'': {}}
E Use -v to get more diff
E Falsifying example: test_for_nesting_limitation(
E data={'': {}},
E )
test_encoders.py:24: AssertionError
========================================= short test summary info ==========================================
FAILED test_encoders.py::test_for_nesting_limitation - AssertionError: assert {'': '{}'} == {'': {}}
Ok, that was unexpected. Let’s try to decipher what went wrong with this test. The TL;DR is that this test shows the encode/decode functions do not work correctly for nested dictionaries.
- The Falsifying Example. The most important clue is at the very bottom. Hypothesis is telling us the exact input that breaks the code.
test_for_nesting_limitation(
data={'': {}},
)
- The input is a dictionary where the key is an empty string and the value is an empty dictionary. This is a classic edge case that a human might overlook.
- The Assertion Error: The test failed because of a failed assert statement:
AssertionError: assert {'': '{}'} == {'': {}}
This is the core of the issue. The original data that went into the test was {‘’: {}}. The decoded result that came out of your functions was {‘’: ‘{}’}. This shows that for the key ‘’, the values are different:
- In decoded, the value is the string ‘{}’.
- In data, the value is the dictionary {}.
A string is not equal to a dictionary, so the assertion assert decoded == data is False, and the test fails.
Tracing the Bug Step-by-Step
Our encode_dict_to_querystring function uses urllib.parse.urlencode. When urlencode sees a value that is a dictionary (like {}), it doesn’t know how to handle it, so it just converts it to its string representation (‘{}’).
The information about the value’s original type (that it was a dict) is lost forever.
When the decode_querystring_to_dict function reads the data back, it correctly decodes the value as the string ‘{}’. It has no way of knowing it was initially a dictionary.
The Solution: Encode Nested Values as JSON Strings
The solution is simple,
- Encode. Before URL-encoding, check each value in your dictionary. If a value is a dict or a list, convert it into a JSON string first.
- Decode. After URL-decoding, check each value. If a value looks like a JSON string (e.g., starts with { or [), parse it back into a Python object.
- Make our testing more comprehensive. Our given decorator is more complex. In simple terms, it tells Hypothesis to generate dictionaries that can contain other dictionaries as values, allowing for nested data structures of any depth. For example,
- A simple, flat dictionary: {‘name’: ‘Alice’, ‘city’: ‘London’}
- A one-level nested dictionary: {‘user’: {‘id’: ‘123’, ‘name’: ‘Tom’}}
- A two-level nested dictionary: {‘config’: {‘database’: {‘host’: ‘localhost’}}}
- And so on…
Here is the fixed code.
# test_encoders.py
from my_encoders import encode_dict_to_querystring, decode_querystring_to_dict
from hypothesis import given, strategies as st
# =========================================================================
# TEST 1: This test proves that the NESTING logic is correct.
# It uses a strategy that ONLY generates strings, so we don't have to
# worry about type conversion. This test will PASS.
# =========================================================================
@given(data=st.recursive(
st.dictionaries(st.text(), st.text()),
lambda children: st.dictionaries(st.text(), children)
))
def test_roundtrip_preserves_nested_structure(data):
"""Property: The encode/decode round-trip should preserve nested structures."""
encoded = encode_dict_to_querystring(data)
decoded = decode_querystring_to_dict(encoded)
assert decoded == data
# =========================================================================
# TEST 2: This test proves that the TYPE CONVERSION logic is correct
# for simple, FLAT dictionaries. This test will also PASS.
# =========================================================================
@given(data=st.dictionaries(st.text(), st.integers() | st.text()))
def test_roundtrip_stringifies_simple_values(data):
"""
Property: The round-trip should convert simple values (like ints)
to strings.
"""
encoded = encode_dict_to_querystring(data)
decoded = decode_querystring_to_dict(encoded)
# Create the model of what we expect: a dictionary with stringified values.
expected_data = {k: str(v) for k, v in data.items()}
assert decoded == expected_data
Now, if we rerun our test, we get this,
(hyp-env) (base) tom@tpr-desktop:~/hypothesis_project$ pytest
=========================================== test session starts ============================================
platform linux -- Python 3.11.10, pytest-8.4.0, pluggy-1.6.0
rootdir: /home/tom/hypothesis_project
plugins: hypothesis-6.135.9, anyio-4.9.0
collected 1 item
test_encoders.py . [100%]
============================================ 1 passed in 0.16s =============================================
What we worked through there is a classic example showcasing how useful testing with Hypothesis can be. What we thought were two simple and error-free functions turned out not to be the case.
Code Example 3— Building a Custom Strategy for a Pydantic Model
Many real-world functions don’t just take simple dictionaries; they take structured objects like Pydantic models. Hypothesis can build strategies for these custom types, too.
Let’s define a model in my_models.py.
# my_models.py
from pydantic import BaseModel, Field
from typing import List
class Product(BaseModel):
id: int = Field(gt=0)
name: str = Field(min_length=1)
tags: List[str]
def calculate_shipping_cost(product: Product, weight_kg: float) -> float:
# A buggy shipping cost calculator
cost = 10.0 + (weight_kg * 1.5)
if "fragile" in product.tags:
cost *= 1.5 # Extra cost for fragile items
if weight_kg > 10:
cost += 20 # Surcharge for heavy items
# Bug: what if cost is negative?
return cost
Now, in test_shipping.py, we’ll build a strategy to generate Product instances and test our buggy function.
# test_shipping.py
from my_models import Product, calculate_shipping_cost
from hypothesis import given, strategies as st
# Build a strategy for our Product model
product_strategy = st.builds(
Product,
id=st.integers(min_value=1),
name=st.text(min_size=1),
tags=st.lists(st.sampled_from(["electronics", "books", "fragile", "clothing"]))
)
@given(
product=product_strategy,
weight_kg=st.floats(min_value=-10, max_value=100, allow_nan=False, allow_infinity=False)
)
def test_shipping_cost_is_always_positive(product, weight_kg):
"""Property: The shipping cost should never be negative."""
cost = calculate_shipping_cost(product, weight_kg)
assert cost >= 0
And the test output?
(hyp-env) (base) tom@tpr-desktop:~/hypothesis_project$ pytest -s test_shipping.py
========================================================= test session starts ==========================================================
platform linux -- Python 3.11.10, pytest-8.4.0, pluggy-1.6.0
rootdir: /home/tom/hypothesis_project
plugins: hypothesis-6.135.9, anyio-4.9.0
collected 1 item
test_shipping.py F
=============================================================== FAILURES ===============================================================
________________________________________________ test_shipping_cost_is_always_positive _________________________________________________
@given(
> product=product_strategy,
^^^
weight_kg=st.floats(min_value=-10, max_value=100, allow_nan=False, allow_infinity=False)
)
test_shipping.py:13:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
product = Product(id=1, name='0', tags=[]), weight_kg = -7.0
@given(
product=product_strategy,
weight_kg=st.floats(min_value=-10, max_value=100, allow_nan=False, allow_infinity=False)
)
def test_shipping_cost_is_always_positive(product, weight_kg):
"""Property: The shipping cost should never be negative."""
cost = calculate_shipping_cost(product, weight_kg)
> assert cost >= 0
E assert -0.5 >= 0
E Falsifying example: test_shipping_cost_is_always_positive(
E product=Product(id=1, name='0', tags=[]),
E weight_kg=-7.0,
E )
test_shipping.py:19: AssertionError
======================================================= short test summary info ========================================================
FAILED test_shipping.py::test_shipping_cost_is_always_positive - assert -0.5 >= 0
========================================================== 1 failed in 0.12s ===========================================================
When you run this with pytest, Hypothesis will quickly find a falsifying example: a product with a negative weight_kg can result in a negative shipping cost. This is an edge case we might not have considered, but Hypothesis found it automatically.
Code Example 4— Testing Stateful Classes
Hypothesis can do more than test pure functions. It can test classes with internal state by generating sequences of method calls to try to break them. Let’s test a simple custom LimitedCache class.
my_cache.py
# my_cache.py
class LimitedCache:
def __init__(self, capacity: int):
if capacity <= 0:
raise ValueError("Capacity must be positive")
self._cache = {}
self._capacity = capacity
# Bug: This should probably be a deque or ordered dict for proper LRU
self._keys_in_order = []
def put(self, key, value):
if key not in self._cache and len(self._cache) >= self._capacity:
# Evict the oldest item
key_to_evict = self._keys_in_order.pop(0)
del self._cache[key_to_evict]
if key not in self._keys_in_order:
self._keys_in_order.append(key)
self._cache[key] = value
def get(self, key):
return self._cache.get(key)
@property
def size(self):
return len(self._cache)
This cache has several potential bugs related to its eviction policy. Let’s test it using a Hypothesis Rule-Based State Machine, which is designed for testing objects with internal state by generating random sequences of method calls to identify bugs that only appear after specific interactions.
Create the file test_cache.py.
from hypothesis import strategies as st
from hypothesis.stateful import RuleBasedStateMachine, rule, precondition
from my_cache import LimitedCache
class CacheMachine(RuleBasedStateMachine):
def __init__(self):
super().__init__()
self.cache = LimitedCache(capacity=3)
# This rule adds 3 initial items to fill the cache
@rule(
k1=st.just('a'), k2=st.just('b'), k3=st.just('c'),
v1=st.integers(), v2=st.integers(), v3=st.integers()
)
def fill_cache(self, k1, v1, k2, v2, k3, v3):
self.cache.put(k1, v1)
self.cache.put(k2, v2)
self.cache.put(k3, v3)
# This rule can only run AFTER the cache has been filled.
# It tests the core logic of LRU vs FIFO.
@precondition(lambda self: self.cache.size == 3)
@rule()
def test_update_behavior(self):
"""
Property: Updating the oldest item ('a') should make it the newest,
so the next eviction should remove the second-oldest item ('b').
Our buggy FIFO cache will incorrectly remove 'a' anyway.
"""
# At this point, keys_in_order is ['a', 'b', 'c'].
# 'a' is the oldest.
# We "use" 'a' again by updating it. In a proper LRU cache,
# this would make 'a' the most recently used item.
self.cache.put('a', 999)
# Now, we add a new key, which should force an eviction.
self.cache.put('d', 4)
# A correct LRU cache would evict 'b'.
# Our buggy FIFO cache will evict 'a'.
# This assertion checks the state of 'a'.
# In our buggy cache, get('a') will be None, so this will fail.
assert self.cache.get('a') is not None, "Item 'a' was incorrectly evicted"
# This tells pytest to run the state machine test
TestCache = CacheMachine.TestCase
Hypothesis will generate long sequences of puts and gets. It will quickly identify a sequence of puts that causes the cache’s size to exceed its capacity or for its eviction to behave differently from our model, thereby revealing bugs in our implementation.
(hyp-env) (base) tom@tpr-desktop:~/hypothesis_project$ pytest -s test_cache.py
========================================================= test session starts ==========================================================
platform linux -- Python 3.11.10, pytest-8.4.0, pluggy-1.6.0
rootdir: /home/tom/hypothesis_project
plugins: hypothesis-6.135.9, anyio-4.9.0
collected 1 item
test_cache.py F
=============================================================== FAILURES ===============================================================
__________________________________________________________ TestCache.runTest ___________________________________________________________
self =
def runTest(self):
> run_state_machine_as_test(cls, settings=self.settings)
../hyp-env/lib/python3.11/site-packages/hypothesis/stateful.py:476:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../hyp-env/lib/python3.11/site-packages/hypothesis/stateful.py:258: in run_state_machine_as_test
state_machine_test(state_machine_factory)
../hyp-env/lib/python3.11/site-packages/hypothesis/stateful.py:115: in run_state_machine
@given(st.data())
^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = CacheMachine({})
@precondition(lambda self: self.cache.size == 3)
@rule()
def test_update_behavior(self):
"""
Property: Updating the oldest item ('a') should make it the newest,
so the next eviction should remove the second-oldest item ('b').
Our buggy FIFO cache will incorrectly remove 'a' anyway.
"""
# At this point, keys_in_order is ['a', 'b', 'c'].
# 'a' is the oldest.
# We "use" 'a' again by updating it. In a proper LRU cache,
# this would make 'a' the most recently used item.
self.cache.put('a', 999)
# Now, we add a new key, which should force an eviction.
self.cache.put('d', 4)
# A correct LRU cache would evict 'b'.
# Our buggy FIFO cache will evict 'a'.
# This assertion checks the state of 'a'.
# In our buggy cache, get('a') will be None, so this will fail.
> assert self.cache.get('a') is not None, "Item 'a' was incorrectly evicted"
E AssertionError: Item 'a' was incorrectly evicted
E assert None is not None
E + where None = get('a')
E + where get = .get
E + where = CacheMachine({}).cache
E Falsifying example:
E state = CacheMachine()
E state.fill_cache(k1='a', k2='b', k3='c', v1=0, v2=0, v3=0)
E state.test_update_behavior()
E state.teardown()
test_cache.py:44: AssertionError
======================================================= short test summary info ========================================================
FAILED test_cache.py::TestCache::runTest - AssertionError: Item 'a' was incorrectly evicted
========================================================== 1 failed in 0.20s ===========================================================
The above output highlights a bug in the code. In simple terms, this output shows that the cache is not a proper “Least Recently Used” (LRU) cache. It has the following significant flaw,
When you update an item that is already in the cache, the cache fails to remember that it’s now the “newest” item. It still treats it as the oldest, so it gets kicked out (evicted) from the cache prematurely.
Code Example 5 — Testing Against a Simpler, Reference Implementation
For our final example, we’ll look at a typical situation. Often, coders write functions that are supposed to replace older, slower, but otherwise perfectly correct, functions. Your new function must have the same outputs as the old function for the same inputs. Hypothesis can make your testing in this regard much easier.
Let’s say we have a simple function, sum_list_simple, and a new, “optimised” sum_list_fast that has a bug.
my_sums.py
# my_sums.py
def sum_list_simple(data: list[int]) -> int:
# This is our simple, correct reference implementation
return sum(data)
def sum_list_fast(data: list[int]) -> int:
# A new "fast" implementation with a bug (e.g., integer overflow for large numbers)
# or in this case, a simple mistake.
total = 0
for x in data:
# Bug: This should be +=
total = x
return total
test_sums.py
# test_sums.py
from my_sums import sum_list_simple, sum_list_fast
from hypothesis import given, strategies as st
@given(st.lists(st.integers()))
def test_fast_sum_matches_simple_sum(data):
"""
Property: The result of the new, fast function should always match
the result of the simple, reference function.
"""
assert sum_list_fast(data) == sum_list_simple(data)
Hypothesis will quickly find that for any list with more than one element, the new function fails. Let’s check it out.
(hyp-env) (base) tom@tpr-desktop:~/hypothesis_project$ pytest -s test_my_sums.py
=========================================== test session starts ============================================
platform linux -- Python 3.11.10, pytest-8.4.0, pluggy-1.6.0
rootdir: /home/tom/hypothesis_project
plugins: hypothesis-6.135.9, anyio-4.9.0
collected 1 item
test_my_sums.py F
================================================= FAILURES =================================================
_____________________________________ test_fast_sum_matches_simple_sum _____________________________________
@given(st.lists(st.integers()))
> def test_fast_sum_matches_simple_sum(data):
^^^
test_my_sums.py:6:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
data = [1, 0]
@given(st.lists(st.integers()))
def test_fast_sum_matches_simple_sum(data):
"""
Property: The result of the new, fast function should always match
the result of the simple, reference function.
"""
> assert sum_list_fast(data) == sum_list_simple(data)
E assert 0 == 1
E + where 0 = sum_list_fast([1, 0])
E + and 1 = sum_list_simple([1, 0])
E Falsifying example: test_fast_sum_matches_simple_sum(
E data=[1, 0],
E )
test_my_sums.py:11: AssertionError
========================================= short test summary info ==========================================
FAILED test_my_sums.py::test_fast_sum_matches_simple_sum - assert 0 == 1
============================================ 1 failed in 0.17s =============================================
So, the test failed because the “fast” sum function gave the wrong answer (0) for the input list [1, 0], while the correct answer, provided by the “simple” sum function, was 1. Now that you know the issue, you can take steps to fix it.
Summary
In this article, we took a deep dive into the world of property-based testing with Hypothesis, moving beyond simple examples to show how it can be applied to real-world testing challenges. We saw that by defining the invariants of our code, we can uncover subtle bugs that traditional testing would likely miss. We learned how to:
- Test the “round-trip” property and see how more complex data strategies can reveal limitations in our code.
- Build custom strategies to generate instances of complex Pydantic models for testing business logic.
- Use a RuleBasedStateMachine to test the behaviour of stateful classes by generating sequences of method calls.
- Validate a complex, optimised function by testing it against a more straightforward, known-good reference implementation.
Adding property-based tests to your toolkit won’t replace all your existing tests. Still, it will profoundly augment them, forcing you to think more clearly about your code’s contracts and giving you a much higher degree of confidence in its correctness. I encourage you to pick a function or class in your codebase, think about its fundamental properties, and let Hypothesis try its best to prove you wrong. You’ll be a better developer for it.
I’ve only scratched the surface of what Hypothesis can do for your testing. For more information, refer to their official documentation, available via the link below.
https://hypothesis.readthedocs.io/en/latest



