Roel Delos Reyes | Articles | Concurrency vs Parallelism in Python

My understanding of concurrency and parallelism was very crucial, and it helped me formulate the right solution to the problems caused by the single-threaded and synchronous nature of Python. This article explains my understanding of concurrency and parallelism.

What is concurrency?

Concurrency is an approach that handles multiple operations that overlap each other. Not necessarily simultaneously (parallelism). There are two types of approaches: Preemptive and Cooperative.

Preemptive is where the program relies on the OS to handle switching of threads (context switch). The OS has an interrupt mechanism to suspend an operation and schedule which operation should be executed next at any given time, making sure that all tasks will get an amount of CPU time. E.g. ThreadPoolExecutor.

Cooperative, an asynchronous approach, does not rely on the OS thread to schedule processes. Instead, the switching of a context happens in the language mechanism controlled explicitly by the async/await syntax. It can be done in a single-thread. E.g. asyncio.

What is parallelism?

Parallelism in Python is best suited for CPU-bound tasks, leveraging multiple cores (CPU) to execute multiple processes separately and simultaneously. Each core will have its own copy (fork) of the Python interpreter, having its own GIL and memory space, so they can truly process different programs at the same time achieving true parallelism. E.g. ProcessExecutor.

Performing I/O Bound Component

Let's use an I/O network request example, like downloading a page from a site using Synchronous (non-concurrent), Multi-threading (concurrent), Asynchronous i/o (concurrent) and Multi-processing (parallel) versions to see how quickly the program runs.

Synchronous version:

                
import time

import requests


def main():
    sites = [
        "https://www.python.org",
        "https://www.docker.com",
        "https://github.com",
    ] * 100
    start_time = time.perf_counter()
    with requests.Session() as session:
        for site in sites:
            get_site(session, site)
    duration = time.perf_counter() - start_time
    print(f"Done retrieving {len(sites)} sites in {duration} seconds.")


def get_site(session, site):
    response = session.get(site)
    print(f"Got {len(response.content)} bytes from {site}.")


if __name__ == "__main__":
    main()

Output:

                
(venv) % python io_sync.py
Got 50282 bytes from https://www.python.org.
Got 366082 bytes from https://www.docker.com.
Got 563573 bytes from https://github.com.
.
.
.
Done retrieving 300 sites in 32.525531874998705 seconds.

It took 32 seconds because the single-threaded program is waiting for the network to finish before moving on to the next one.

Multi-threading version:

                
import time
from concurrent.futures import ThreadPoolExecutor
from functools import partial

import requests


def main():
    sites = [
        "https://www.python.org",
        "https://www.docker.com",
        "https://github.com",
    ] * 100
    start_time = time.perf_counter()
    with requests.Session() as session:
        with ThreadPoolExecutor(max_workers=3) as executor:
            get_site_with_session = partial(get_site, session)
            executor.map(get_site_with_session, sites)
    duration = time.perf_counter() - start_time
    print(f"Done retrieving {len(sites)} sites in {duration} seconds.")


def get_site(session, site):
    response = session.get(site)
    print(f"Got {len(response.content)} bytes from {site}.")


if __name__ == "__main__":
    main()

Output:

                
(venv) % python io_multi_threading.py
Got 50282 bytes from https://www.python.org.
Got 366082 bytes from https://www.docker.com.
Got 563597 bytes from https://github.com.
.
.
.
Done retrieving 300 sites in 6.31756170799963 seconds.

There is a huge performance improvement when using ThreadPoolExecutor because it allows multiple threads to perform network requests concurrently.

Network operations are I/O-bound, meaning each thread often spends time waiting for a response from the server. While one thread is blocked waiting for data, the other threads can continue sending or receiving requests.

This overlap of waiting periods lets multiple requests progress simultaneously, instead of waiting for one to finish before starting the next. As a result, total execution time decreases significantly.

Asynchronous I/O version:

                
import asyncio
import time

import aiohttp


async def main():
    sites = [
        "https://www.python.org",
        "https://www.docker.com",
        "https://github.com",
    ] * 100
    start_time = time.perf_counter()
    async with aiohttp.ClientSession() as session:
        tasks = [get_site(session, site) for site in sites]
        await asyncio.gather(*tasks, return_exceptions=True)
    duration = time.perf_counter() - start_time
    print(f"Done retrieving {len(sites)} sites in {duration} seconds.")


async def get_site(session, site):
    async with session.get(site) as response:
        print(f"Got {len(await response.read())} bytes from {site}.")


if __name__ == "__main__":
    asyncio.run(main())

Output:

                
(venv) % python io_asyncio.py
Got 563573 bytes from https://github.com.
Got 563573 bytes from https://github.com.
Got 563573 bytes from https://github.com.
.
.
.
Done retrieving 300 sites in 3.654484542001228 seconds.

It's faster than the Multi-threading version and about 10× faster than the synchronous version.

Like multithreading, asyncio allows multiple network requests to run concurrently. However, instead of using multiple OS threads, asyncio runs all tasks in a single thread using an event loop.

When one task is waiting for a network response, the event loop automatically switches to another task that's ready to run. This avoids the overhead of creating or switching between threads, making async I/O more lightweight and efficient for I/O-bound operations.

Multi-processing version:

                
import time
from concurrent.futures import ProcessPoolExecutor
from functools import partial

import requests


def main():
    sites = [
        "https://www.python.org",
        "https://www.docker.com",
        "https://github.com",
    ] * 100
    start_time = time.perf_counter()
    with requests.Session() as session:
        with ProcessPoolExecutor(max_workers=3) as executor:
            get_site_with_session = partial(get_site, session)
            executor.map(get_site_with_session, sites)
    duration = time.perf_counter() - start_time
    print(f"Done retrieving {len(sites)} sites in {duration} seconds.")


def get_site(session, site):
    response = session.get(site)
    print(f"Got {len(response.content)} bytes from {site}.")


if __name__ == "__main__":
    main()

Output:

                
(venv) % python io_multi_processing.py
Got 50282 bytes from https://www.python.org.
Got 301749 bytes from https://www.docker.com.
Got 563573 bytes from https://github.com.
.
.
.
Done retrieving 300 sites in 22.975569459000326 seconds.

It's faster than the synchronous version but not as efficient as the multi-threaded or async I/O versions for network-bound tasks.

It creates multiple separate processes running in parallel. However, each process resembles a non-concurrent or synchronous version, also adding the extra overhead of creating and managing separate processes. That's why it's slower than multi-threading or asyncio.

Performing CPU-bound Component

This time, using a function to calculate prime numbers to simulate a CPU-bound process and try every version of concurrency and parallelism, similar to the I/O network request examples above.

Synchronous version:

                
import math
import time


def is_prime(n):
    if n < 2:
        return False
    if n == 2:
        return True
    if n % 2 == 0:
        return False

    sqrt_n = int(math.floor(math.sqrt(n)))
    for i in range(3, sqrt_n + 1, 2):
        if n % i == 0:
            return False
    return True


def main():
    PRIMES = [
        112272535095293,
        112582705942171,
        112272535095293,
        115280095190773,
        115797848077099,
        1099726899285419,
    ] * 100

    start_time = time.perf_counter()
    for prime in PRIMES:
        is_prime(prime)
    duration = time.perf_counter() - start_time

    print(f"Done in {duration} seconds.")


if __name__ == "__main__":
    main()

Output:

                
(venv) % python cpu_sync.py
Done in 97.74420645899954 seconds.

So here we have the base duration of the synchronous version. Let's try multi-threading to see if there is an improvement.

Multi-threading version:

                
import math
import time
from concurrent.futures import ThreadPoolExecutor


def is_prime(n):
    if n < 2:
        return False
    if n == 2:
        return True
    if n % 2 == 0:
        return False

    sqrt_n = int(math.floor(math.sqrt(n)))
    for i in range(3, sqrt_n + 1, 2):
        if n % i == 0:
            return False
    return True


def main():
    PRIMES = [
        112272535095293,
        112582705942171,
        112272535095293,
        115280095190773,
        115797848077099,
        1099726899285419,
    ] * 100

    start_time = time.perf_counter()

    with ThreadPoolExecutor(max_workers=3) as executor:
        executor.map(is_prime, PRIMES)
    duration = time.perf_counter() - start_time

    print(f"Done in {duration} seconds.")


if __name__ == "__main__":
    main()

Output:

                
(venv) % python cpu_multi_theading.py
Done in 100.03000933300063 seconds.

It looks like the speed slightly diminished when using ThreadPoolExecutor than the synchronous version.

This is because for CPU-bound tasks, there are no external network resources waiting, and no room for concurrent flow because each process competes for acquiring the GIL.

Additionally, the creation and thread management introduce context-switching overhead, which can make threaded code slower than simple sequential execution.

The same thing can be expected for the asynchronous I/O flow.

Asynchronous I/O version:

                
import asyncio
import math
import time


async def is_prime(n):
    if n < 2:
        return False
    if n == 2:
        return True
    if n % 2 == 0:
        return False

    sqrt_n = int(math.floor(math.sqrt(n)))
    for i in range(3, sqrt_n + 1, 2):
        if n % i == 0:
            return False
    return True

async def main():
    PRIMES = [
        112272535095293,
        112582705942171,
        112272535095293,
        115280095190773,
        115797848077099,
        1099726899285419
    ] * 100
 
    start_time = time.perf_counter()
    tasks = [is_prime(number) for number in PRIMES]
    await asyncio.gather(*tasks, return_exceptions=True)
    duration = time.perf_counter() - start_time
    print(f"Done in {duration} seconds.")


if __name__ == "__main__":
    asyncio.run(main())

Output:

                
(venv) % python cpu_asyncio.py
Done in 100.84175841700016 seconds.

Since there is no network I/O requests, the async/await will basically run sequentially within the event loop with extra overhead of the event loop and unnecessary context switching, which makes it slower than the synchronous version.

In theory, it could be the most inefficient among the other if the is_prime involves a recursive call, doubling the amount of overhead for the async/await context switching. The asyncio was designed for I/O bound workload, not CPU-bound.

Let's try the multi-processing approach.

Multi-processing version:

                
import math
import time
from concurrent.futures import ProcessPoolExecutor


def is_prime(n):
    if n < 2:
        return False
    if n == 2:
        return True
    if n % 2 == 0:
        return False

    sqrt_n = int(math.floor(math.sqrt(n)))
    for i in range(3, sqrt_n + 1, 2):
        if n % i == 0:
            return False
    return True


def main():
    PRIMES = [
        112272535095293,
        112582705942171,
        112272535095293,
        115280095190773,
        115797848077099,
        1099726899285419,
    ] * 100

    start_time = time.perf_counter()
    with ProcessPoolExecutor(max_workers=3) as executor:
        executor.map(is_prime, PRIMES)
    duration = time.perf_counter() - start_time

    print(f"Done in {duration} seconds.")


if __name__ == "__main__":
    main()

Output:

                
(venv) % python cpu_multi_processing.py
Done in 35.96208695899986 seconds.

For CPU-bound tasks, we can see that multiprocessing provides the best performance among the approaches.

Unlike threads or async I/O, multiple processes can execute truly in parallel on separate CPU cores, because each process runs in its own Python interpreter with its own Global Interpreter Lock (GIL) and memory space.

When to use concurrency and parallelism?

Use concurrency (ThreadPoolExecutor or asyncio) when performing I/O bound tasks such as network requests, web API calls, database queries, and reading/writing files, which takes amount of waiting time from external resources to perform.

Use parallelism (ProcessPoolExecutor) when performing CPU-bound tasks, such as heavy data transformation or numerical computation, image processing, encryption, etc.

In more complex workflows, you can also combine concurrency and parallelism. For example, using concurrency to download data from multiple sources and parallelism to process those datasets efficiently across CPU cores.

Summary

For I/O-bound tasks

Approach	Remarks
Synchronous:	Slow, tasks block while waiting for I/O, so only one runs at a time.
Multithreading:	Very fast, threads can make progress while others wait on I/O; ideal for concurrent I/O workloads.
Asyncio:	Very fast, the most efficient. Single-threaded, but uses an event loop and cooperative context switching to handle many I/O operations concurrently with minimal overhead.
Multiprocessing:	Fast but not resource-efficient. unnecessary for I/O-bound work since processes add heavy overhead and duplicate memory.

For CPU-bound tasks

Approach	Remarks
Synchronous:	Slow, only one core is used.
Multithreading:	Very slow, threads compete for the GIL, preventing true parallel execution and adding context-switch overhead.
Asyncio:	Super slow, runs sequentially inside one thread; async overhead adds even more latency.
Multiprocessing:	Very fast, the most efficient, each process runs on its own core, bypassing the GIL and achieving true parallelism!

Hope this helps. Cheers!