Free-threading Update - 3.14.0a7

I mentioned in my previous post a lock contention issue when accessing the same list from different threads, leading to worse performance the more threads were used. Turns out this is fixed in Python 3.14.0a7 already!

Unfortunately this version doesn't (yet) seem available from uv (which is why I hadn't originally tried it), but it can be downloaded and installed manually from the Python website (click Customize at the installation step and check the free-threading box). On Mac OS, the interpreter binaries can then be found at:

/Library/Frameworks/Python.framework/Versions/3.14/bin/python3
/Library/Frameworks/PythonT.framework/Versions/3.14/bin/python3t

With this version we observe much more reasonable scaling out of the box across all three benchmarks:

Transpose

During all this benchmarking I couldn't help but notice that the matrix multiplication is quite naive, in more realistic code we would at the very least transpose B before multiplying. This gets us sequential memory access in the inner loop:

def matrix_multiply_row(args):
  """Multiply one row of matrix A with B (for threading)"""
  i, A, B_transpose = args
  n = len(B_transpose)
  row = [0] * n

  for j in range(n):
    A_row = A[i]
    B_row = B_transpose[j]
    sum = 0
    for k in range(n):
      sum += A_row[k] * B_row[k]
    row[j] = sum

  return i, row


def matrix_multiply(A, B, num_threads=1):
  """Multiply two matrices using specified number of threads"""
  n = len(A)
  C = [[0 for _ in range(n)] for _ in range(n)]

  B_transpose = [[0 for _ in range(n)] for _ in range(n)]
  for i in range(n):
    for j in range(n):
      B_transpose[j][i] = B[i][j]

  # Prepare row tasks
  tasks = [(i, A, B_transpose) for i in range(n)]

  # Execute with ThreadPoolExecutor - works for both sequential and parallel
  with ThreadPoolExecutor(max_workers=num_threads) as executor:
    for i, row in executor.map(matrix_multiply_row, tasks):
      C[i] = row

  return C

Aside:

sum = 0
for k in range(n):
    sum += A_row[k] * B_row[k]
row[j] = sum

might seem like a tempting place to use sum and zip:

row[j] = sum(a * b for a, b in zip(A_row, B_row))

but this is actually much slower, probably because zip has to create temporary tuple instances.

Overall the transpose trick gives us a roughly 2x speed-up, and somewhat better scaling behaviour:

Concurrent Reads

While investigating this issue I started benchmarking concurrent reads on various data structures. The good news is that in alpha 7, most containers (list, tuple, dict, string, bytes) can now be quickly read from multiple threads, only bytearray remains slow.

If you are curious, there's a gist with full benchmark and results.

Tags: python, programming, threading