Sign Up to Our Newsletter

Be the first to know the latest tech updates

[mc4wp_form id=195]

Make Python Up to 150× Faster with C

Make Python Up to 150× Faster with C


, sooner or later, you’re going to hit a block when it comes to your code execution speed. If you’ve ever written a computationally heavy algorithm in Python, such as string distance, matrix math, or cryptographic hashing, you’ll know what I mean.

Sure, there are times when external libraries like NumPy can come to your rescue, but what happens when the algorithm is inherently sequential? That was precisely my problem when I wanted to benchmark a particular algorithm, which determines the number of edits required to transform one string into another.

I tried Python. I tried NumPy. And then I turned to C, a language I first learned at college decades ago but hadn’t used in anger for about 15 years. That’s where things got interesting.

I first had to answer the question, “Can you call C from Python?”. After some research, it quickly became clear that the answer was indeed yes. In fact, it turns out you can do it in several ways, and in this article, I’ll look at three of the most common ways of doing so. 

From easiest to most complex, we’ll look at using,

  • a subprocess
  • ctypes
  • Python C extensions

The algorithm that we’ll be testing with is called the Levenshtein Distance (LD) algorithm. The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. It is named after Soviet mathematician Vladimir Levenshtein, who defined the metric in 1965. It has applications in various tools, such as spell checkers and optical character recognition systems.

To give you a clearer picture of what we’re talking about, here are a couple of examples. 

Calculate the LD between the words “book” and “black”.

  1. book → baok (substitution of “a” for “o”),
  2. baok → back (substitution of “c” for “o”)
  3. back → black (add the letter “l”)

So, the LD in this case is three.

Calculate the LD between the words “superb” and “super”.

  1. superb → super (remove letter “b”)

The LD in this case is simply one.

We’ll code up the LD algorithm in Python and C, then set up benchmarks to test how long it takes to run it using pure Python code versus the time it takes to run it in C code called from Python. 

Prerequisites

As I was running this on MS Windows, I needed a way of compiling C programs. The easiest way I found to do this was to download the Visual Studio Build tools for 2022. This allows you to compile C programs on the command line.

To install, first go to the main Visual Studio downloads page. On the second screen, you’ll see a search box. Type “Build Tools” into the search field and click Search. The search should return a screen that looks like this,

Image by Author

Click the Download button and follow any installation instructions. Once it’s been installed, in a DOS terminal window, when you click on the little plus button to open a new terminal, you should see an option to open a “Developer command prompt for VS 2022”. 

Image by Author

Most of my Python code will be running on a Jupyter Notebook, so you should set up a new development environment and install Jupyter. Do that now if you want to follow along. I’m using the UV tool for this part, but feel free to use whichever method you’re most comfortable with.

c:\> uv init pythonc
c:\> cd pythonc
c:\> uv venv pythonc
c:\> source pythonc/bin/activate
(pythonc) c:\> uv pip install jupyter

The LD algorithm in C

We need slightly different versions of the LD algorithm in C, depending on the method used to call it. This is the version for our first example, where we use subprocessing to call a C executable.

1/ subprocessing: lev_sub.c

#include 
#include 
#include 

static int levenshtein(const char* a, const char* b) {
    size_t n = strlen(a), m = strlen(b);
    if (n == 0) return (int)m;
    if (m == 0) return (int)n;
    int* prev = (int*)malloc((m + 1) * sizeof(int));
    int* curr = (int*)malloc((m + 1) * sizeof(int));
    if (!prev || !curr) { free(prev); free(curr); return -1; }
    for (size_t j = 0; j <= m; ++j) prev[j] = (int)j;
    for (size_t i = 1; i <= n; ++i) {
        curr[0] = (int)i; char ca = a[i - 1];
        for (size_t j = 1; j <= m; ++j) {
            int cost = (ca == b[j - 1]) ? 0 : 1;
            int del = prev[j] + 1, ins = curr[j - 1] + 1, sub = prev[j - 1] + cost;
            int d = del < ins ? del : ins; curr[j] = d < sub ? d : sub;
        }
        int* tmp = prev; prev = curr; curr = tmp;
    }
    int ans = prev[m]; free(prev); free(curr); return ans;
}

int main(int argc, char** argv) {
    if (argc != 3) { fprintf(stderr, "usage: %s  \n", argv[0]); return 2; }
    int d = levenshtein(argv[1], argv[2]);
    if (d < 0) return 1;
    printf("%d\n", d);
    return 0;
}

To compile this, start a new Developer Command Prompt for VS Code 2022 and type the following to ensure we are optimising the compilation for 64-bit architecture.

(pythonc) c:\> "%VSINSTALLDIR%VC\Auxiliary\Build\vcvarsall.bat" x64

Next, we can compile our C code using this command.

(pythonc) c:\> cl /O2 /Fe:lev_sub.exe lev_sub.c

That will create an executable file.

Benchmarking the subprocessing code

In a Jupyter notebook, type in the following code, which will be common to all our benchmarking. It generates random lowercase strings of length N and calculates the number of edits required to transform string1 into string2.

# Sub-process benchmark
import time, random, string, subprocess
import numpy as np

EXE = r"lev_sub.exe"  

def rnd_ascii(n):
    return ''.join(random.choice(string.ascii_lowercase) for _ in range(n))

def lev_py(a: str, b: str) -> int:
    n, m = len(a), len(b)
    if n == 0: return m
    if m == 0: return n
    prev = list(range(m+1))
    curr = [0]*(m+1)
    for i, ca in enumerate(a, 1):
        curr[0] = i
        for j, cb in enumerate(b, 1):
            cost = 0 if ca == cb else 1
            curr[j] = min(prev[j] + 1, curr[j-1] + 1, prev[j-1] + cost)
        prev, curr = curr, prev
    return prev[m]

Next is the actual benchmarking code and run results. To run the C part of the code, we spawn a subprocess that executes the compiled C code file we previously created and measures the time it takes to run, comparing it with the pure Python method. We run each method against a 2000 and a 4000 set of random words three times and take the fastest of those times.

def lev_subprocess(a: str, b: str) -> int:
    out = subprocess.check_output([EXE, a, b], text=True)
    return int(out.strip())

def bench(fn, *args, repeat=3, warmup=1):
    for _ in range(warmup): fn(*args)
    best = float("inf"); out_best = None
    for _ in range(repeat):
        t0 = time.perf_counter(); out = fn(*args); dt = time.perf_counter() - t0
        if dt < best: best, out_best = dt, out
    return out_best, best

if __name__ == "__main__":
    cases = [(2000,2000),(4000, 4000)]
    print("Benchmark: Pythonvs C (subprocess)\n")
    for n, m in cases:
        a, b = rnd_ascii(n), rnd_ascii(m)
        py_out, py_t = bench(lev_py, a, b, repeat=3)
        sp_out, sp_t = bench(lev_subprocess, a, b, repeat=3)
        print(f"n={n} m={m}")
        print(f"  Python   : {py_t:.3f}s -> {py_out}")
        print(f"  Subproc  : {sp_t:.3f}s -> {sp_out}\n")

 Here are the results.

Benchmark: Python vs C (subprocess)

n=2000 m=2000 
  Python   : 1.276s -> 1768
  Subproc  : 0.024s -> 1768

n=4000 m=4000 
  Python   : 5.015s -> 3519
  Subproc  : 0.050s -> 3519

That’s a pretty significant improvement in the run-time of C over Python.

2. ctypes: lev.c

ctypes is a foreign function interface (FFI) library built right into Python’s standard library. It lets you load and call functions from shared libraries written in C (DLLs on Windows, .so files on Linux, .dylib on macOS) directly from Python, without needing to write a complete extension module in C.

First, here is our C version of the LD algorithm, utilising ctypes. It’s almost identical to our subprocess C function, with the addition of a line that enables us to use Python to call the DLL after it has been compiled.

/* 
 * lev.c
*/

#include 
#include 

/* below line includes this function in the 
 * DLL's export table so other programs can use it.
 */
__declspec(dllexport)

int levenshtein(const char* a, const char* b) {
    size_t n = strlen(a), m = strlen(b);
    if (n == 0) return (int)m;
    if (m == 0) return (int)n;

    int* prev = (int*)malloc((m + 1) * sizeof(int));
    int* curr = (int*)malloc((m + 1) * sizeof(int));
    if (!prev || !curr) { free(prev); free(curr); return -1; }

    for (size_t j = 0; j <= m; ++j) prev[j] = (int)j;

    for (size_t i = 1; i <= n; ++i) {
        curr[0] = (int)i;
        char ca = a[i - 1];
        for (size_t j = 1; j <= m; ++j) {
            int cost = (ca == b[j - 1]) ? 0 : 1;
            int del = prev[j] + 1;
            int ins = curr[j - 1] + 1;
            int sub = prev[j - 1] + cost;
            int d = del < ins ? del : ins;
            curr[j] = d < sub ? d : sub;
        }
        int* tmp = prev; prev = curr; curr = tmp;
    }
    int ans = prev[m];
    free(prev); free(curr);
    return ans;
}

When using ctypes to call C in Python, we need to convert our C code into a dynamic link library (DLL) rather than an executable. Here is the build command you need for that.

(pythonc) c:\> cl /O2 /LD lev.c /Fe:lev.dll

Benchmarking the ctypes code

I’m omitting the lev_py and rnd_ascii Python functions in this code snippet, as they are identical to those in the previous example. Type this into your notebook.

#ctypes benchmark

import time, random, string, ctypes
import numpy as np

DLL = r"lev.dll"  

levdll = ctypes.CDLL(DLL)
levdll.levenshtein.argtypes = [ctypes.c_char_p, ctypes.c_char_p]
levdll.levenshtein.restype  = ctypes.c_int

def lev_ctypes(a: str, b: str) -> int:
    return int(levdll.levenshtein(a.encode('utf-8'), b.encode('utf-8')))

def bench(fn, *args, repeat=3, warmup=1):
    for _ in range(warmup): fn(*args)
    best = float("inf"); out_best = None
    for _ in range(repeat):
        t0 = time.perf_counter(); out = fn(*args); dt = time.perf_counter() - t0
        if dt < best: best, out_best = dt, out
    return out_best, best

if __name__ == "__main__":
    cases = [(2000,2000),(4000, 4000)]
    print("Benchmark: Python vs NumPy vs C (ctypes)\n")
    for n, m in cases:
        a, b = rnd_ascii(n), rnd_ascii(m)
        py_out, py_t = bench(lev_py, a, b, repeat=3)
        ct_out, ct_t = bench(lev_ctypes, a, b, repeat=3)
        print(f"n={n} m={m}")
        print(f"  Python   : {py_t:.3f}s -> {py_out}")
        print(f"  ctypes   : {ct_t:.3f}s -> {ct_out}\n")

And the results?

Benchmark: Python vs C (ctypes)

n=2000 m=2000  
  Python   : 1.258s -> 1769
  ctypes   : 0.019s -> 1769

n=4000 m=4000 
  Python   : 5.138s -> 3521
  ctypes   : 0.035s -> 3521

We have very similar results to the first example.

3/ Python C extensions: lev_cext.c

When using Python C extensions, there’s slightly more work involved. First, let’s examine the C code. The basic algorithm is unchanged. It’s just that we need to add a bit more scaffolding to enable the code to be called from Python. It uses CPython’s API (Python.h) to parse Python arguments, run the C code, and return the result as a Python integer.

The function levext_lev acts as a wrapper. It parses two string arguments from Python (PyArg_ParseTuple), calls the C function lev_impl to compute the distance, handles memory errors, and returns the result as a Python integer (PyLong_FromLong). The Methods table registers this function under the name “levenshtein”, so it can be called from Python code. Finally, PyInit_levext defines and initialises the module levext, making it importable in Python with the import levext command.

#include 
#include 
#include 

static int lev_impl(const char* a, const char* b) {
    size_t n = strlen(a), m = strlen(b);
    if (n == 0) return (int)m;
    if (m == 0) return (int)n;
    int* prev = (int*)malloc((m + 1) * sizeof(int));
    int* curr = (int*)malloc((m + 1) * sizeof(int));
    if (!prev || !curr) { free(prev); free(curr); return -1; }
    for (size_t j = 0; j <= m; ++j) prev[j] = (int)j;
    for (size_t i = 1; i <= n; ++i) {
        curr[0] = (int)i; char ca = a[i - 1];
        for (size_t j = 1; j <= m; ++j) {
            int cost = (ca == b[j - 1]) ? 0 : 1;
            int del = prev[j] + 1, ins = curr[j - 1] + 1, sub = prev[j - 1] + cost;
            int d = del < ins ? del : ins; curr[j] = d < sub ? d : sub;
        }
        int* tmp = prev; prev = curr; curr = tmp;
    }
    int ans = prev[m]; free(prev); free(curr); return ans;
}

static PyObject* levext_lev(PyObject* self, PyObject* args) {
    const char *a, *b;
    if (!PyArg_ParseTuple(args, "ss", &a, &b)) return NULL;
    int d = lev_impl(a, b);
    if (d < 0) { PyErr_SetString(PyExc_MemoryError, "alloc failed"); return NULL; }
    return PyLong_FromLong(d);
}

static PyMethodDef Methods[] = {
    {"levenshtein", levext_lev, METH_VARARGS, "Levenshtein distance"},
    {NULL, NULL, 0, NULL}
};

static struct PyModuleDef mod = { PyModuleDef_HEAD_INIT, "levext", NULL, -1, Methods };
PyMODINIT_FUNC PyInit_levext(void) { return PyModule_Create(&mod); }

Because we aren’t just building an executable this time but a native Python extension module, we have to build the C code differently.

This type of module must be compiled against Python’s headers and appropriately linked to Python’s runtime so it behaves like a built-in Python module. 

To achieve this, we create a Python module called setup.py, which imports the setuptools library to facilitate this process. It automates:

  • Finding the right include paths for Python.h
  • Passing the correct compiler and linker flags
  • Producing a .pyd file with the right naming convention for your Python version and platform

Doing this by hand with the cl compiler command would be tedious and error-prone, because you’d have to specify all of those paths and flags manually.

Here is the code we need.

from setuptools import setup, Extension
setup(
    name="levext",
    version="0.1.0",
    ext_modules=[Extension("levext", ["lev_cext.c"], extra_compile_args=["/O2"])],
)

We run it using regular Python on the command line, as shown here.

(pythonc) c:\> python setup.py build_ext --inplace

#output
running build_ext
copying build\lib.win-amd64-cpython-312\levext.cp312-win_amd64.pyd ->

Benchmarking the Python C extensions code

Now, here is the Python code to call our C. Again, I have omitted the two Python helper functions that are unchanged from the previous examples.

# c-ext benchmark

import time, random, string
import numpy as np
import levext  # make sure levext.cp312-win_amd64.pyd is built & importable

def lev_extension(a: str, b: str) -> int:
    return levext.levenshtein(a, b)

def bench(fn, *args, repeat=3, warmup=1):
    for _ in range(warmup): fn(*args)
    best = float("inf"); out_best = None
    for _ in range(repeat):
        t0 = time.perf_counter(); out = fn(*args); dt = time.perf_counter() - t0
        if dt < best: best, out_best = dt, out
    return out_best, best

if __name__ == "__main__":
    cases = [(2000, 2000), (4000, 4000)]
    print("Benchmark: Python vs NumPy vs C (C extension)\n")
    for n, m in cases:
        a, b = rnd_ascii(n), rnd_ascii(m)
        py_out, py_t = bench(lev_py, a, b, repeat=3)
        ex_out, ex_t = bench(lev_extension, a, b, repeat=3)
        print(f"n={n} m={m} ")
        print(f"  Python   : {py_t:.3f}s -> {py_out}")
        print(f"  C ext    : {ex_t:.3f}s -> {ex_out}\n")

Here is the output.

Benchmark: Python vs C (C extension)

n=2000 m=2000  
  Python   : 1.204s -> 1768
  C ext    : 0.010s -> 1768

n=4000 m=4000  
  Python   : 5.039s -> 3526
  C ext    : 0.033s -> 3526

So this gave the fastest results. The C version is showing up as over 150 times faster than pure Python in the second test case above. 

Not too shabby.

But what about NumPy?

Some of you may be wondering why NumPy wasn’t used. Well, NumPy is fantastic for vectorised numeric array operations, such as dot products, but not all algorithms map cleanly to vectorisation. Calculating Levenshtein distances is an inherently sequential process, so NumPy doesn’t provide much help. In those cases, dropping into C via subprocess, ctypes, or a native C extension provides real runtime speedups while still being callable from Python.

PS. I ran some additional tests using code that was amenable to using NumPy, and it was no surprise that NumPy was as fast as the called C code. That’s to be expected as NumPy uses C under the hood and has many years of development and optimisation behind it.

Summary

The article explores how Python developers can overcome performance bottlenecks in computationally intensive tasks, such as calculating the Levenshtein distance — a string similarity algorithm —by integrating C code into Python. While libraries like NumPy accelerate vectorised operations, sequential algorithms like Levenshtein often remain impervious to NumPy’s optimisations.

To address this, I demonstrated three integration patterns, ranging from simplest to most advanced, that allow you to call fast C code from Python.

Subprocess. Compile the C code into an executable (e.g., with gcc or Visual Studio Build Tools) and run it from Python using the subprocess module. This is easy to set up and already shows a huge speedup compared to pure Python.

ctypes. Using ctypes lets Python directly load and call functions from C shared libraries without needing to write a full Python extension module. This makes it much simpler and faster to integrate performance-critical C code into Python, avoiding the overhead of running external processes while still keeping your code mostly in Python.

Python C Extensions. Write a full Python extension in C using the CPython API (python.h). This requires more setup but offers the fastest performance and smoothest integration, allowing you to call C functions as if they were native Python functions.

The benchmarks show that C implementations of the Levenshtein algorithm run over 100× faster than pure Python. While an external library such as NumPy excels at vectorised numeric operations, it doesn’t significantly improve performance for inherently sequential algorithms like Levenshtein, making C integration a better choice in such cases.

If you hit performance limits in Python, offloading heavy computation to C can provide massive speed improvements and is worth considering. You can start simple with subprocess, then move to ctypes or full C extensions for tighter integration and better performance.

I’ve only outlined three of the most popular ways to integrate C code with Python, but there are a few other methods which I recommend you read up on if this topic interests you.



Source link

Thomas Reid

About Author

TechToday Logo

Your go-to destination for the latest in tech, AI breakthroughs, industry trends, and expert insights.

Get Latest Updates and big deals

Our expertise, as well as our passion for web design, sets us apart from other agencies.

Digitally Interactive  Copyright 2022-25 All Rights Reserved.