← Blog

Martin Chapman

C++26 and The Sender/Receiver Pattern: A Game-Changer for High-Performance Computing

With the upcoming C++26 standard, a new asynchronous execution framework, known as the Sender/Receiver pattern, is poised to be a game-changer.

C++26 and The Sender/Receiver Pattern: A Game-Changer for High-Performance Computing

Hello Everyone,

It's about dang time.

The world of high-performance computing (HPC) has long existed on a razor's edge between raw power and cumbersome platform-specific APIs. For decades, C++ developers seeking to build highly-scalable, asynchronous systems have been forced to abandon the C++ Standard Library in favor of operating system-specific primitives like Windows' I/O Completion Ports (IOCP) or Linux's io_uring. These low-level APIs were the only way to achieve the best performance because the standard C++ concurrency features were simply not robust enough for the task.

But all of that is about to change.

With the upcoming C++26 standard, a new asynchronous execution framework, known as the Sender/Receiver pattern, is poised to be a game-changer. This feature finally provides C++ developers with a powerful, portable, and composable model that rivals the performance of its platform-specific predecessors and positions ISO C++ as a best-in-class contender for HPC.

The Challenge of Legacy C++ Concurrency

Prior to this new standard, asynchronous programming in C++ was often a struggle. While features like std::async and std::thread provided basic building blocks, they lacked a coherent, standardized model for managing complex asynchronous workflows. They were not designed to gracefully manage millions of concurrent tasks or to handle critical concerns like backpressure and graceful shutdown. The result was that for any serious, high-scale application, developers had to resort to hand-coding their own solutions using low-level OS APIs, sacrificing portability in the process.

A New Architecture for Best-in-Class Performance

The Sender/Receiver pattern solves these problems by providing a new, high-level abstraction that is built to leverage the most powerful underlying concurrency models. As demonstrated in a recent coding exercise, this pattern allows for the construction of a highly advanced and resilient architecture with several key advantages:

  • Scalable Flow Control: A producer-consumer model is at the heart of this architecture. The producer thread submits an immense number of tasks, but instead of creating a giant, memory-hungry object, it dynamically places them onto a scheduler's work queue. This queue is bounded, automatically applying "backpressure" that blocks the producer if it gets too far ahead. This prevents the process from crashing due to stack overflow or excessive memory consumption, a critical safety feature for high-task-count systems.
  • Scalable Asynchronous I/O: Beyond Completion Routines: The Sender/Receiver pattern is designed to overcome the limitations of older asynchronous I/O models like Alertable I/O on Windows. In those models, the completion routine often runs on the same thread that initiated the I/O call, creating a potential bottleneck. In contrast, the Sender/Receiver pattern allows for a many-to-many model where a large number of tasks can be distributed across a pool of worker threads. This is a massive advantage for scalability and throughput. With this pattern, threads can enter a natural wait state while waiting for new work without needing to use mutexes, semaphores, or other constructs that can introduce contention and create choke points. This freedom from complex synchronization primitives simplifies the code and allows the system to scale efficiently, whether it's handling a high volume of file I/O or network requests.
  • Graceful and Controlled Shutdown: With the help of std::stop_token and std::signal, the main thread can gracefully handle a shutdown request (like Ctrl-C). Instead of brute-force terminating worker threads, it simply signals a "stop" intention. The producer thread will immediately stop submitting new tasks, and any in-progress tasks will be allowed to complete. This ensures data integrity and a clean exit, a stark contrast to the abrupt shutdowns of less-sophisticated models.

A Practical Look with NVIDIA stdexec

While the C++26 standard is still a few months away, the community has been hard at work creating reference implementations to test and prove the power of this new model. We owe a huge debt of gratitude to companies like NVIDIA for taking the initiative to develop the stdexec library as a high-performance, open-source proof-of-concept. The fact that a leader in high-performance computing is investing heavily in this new standard is a powerful signal to all serious C++ developers that this is a feature you should be paying close attention to.

Below is a demonstration of the pattern I've been discussing, implemented using the stdexec library. This code showcases how to create a console application that scales to billions of tasks with graceful shutdown and a producer-consumer model that avoids memory overruns. While the code compiled and ran successfully in Visual Studio 2022 on Windows 11, the stdexec code on GitHub should be considered a proof of concept and not an official implementation, so some behavior may not work completely as expected.

One specific behavior noted in this prototype was that the scheduler's work queue did not appear to be bounded and continued to grow as tasks were submitted. While this might be an issue with my implementation or a feature not yet fully implemented, the final ISO version is expected to behave as a bounded FIFO queue, which is critical for backpressure. On a related note, I hope that members of the ISO C++ committee will consider improving the std::thread::hardware_concurrency() function. As the example code shows, a platform-specific workaround like GetActiveProcessorCount(ALL_PROCESSOR_GROUPS) is required on some multi-NUMA systems to correctly return all logical cores. A flag or an overload that provides this functionality natively would be a valuable addition to the standard library.I apologize for the code formatting, the LinkedIn code block forces lines to wrap instead of providing a horizontal scrollbar.

C++
// Demonstrates a producer-consumer pattern with graceful shutdown
// using C++26 std::execution framework (via NVIDIA's stdexec library).
//
// The core concepts are mapped as follows:
// - IOCP Thread Pool -> exec::static_thread_pool
// - Completion Port for Shutdown -> std::stop_source and std::stop_token
// - Producer Thread -> launches task on a worker thread.
// - Worker Threads -> The threads managed by the static_thread_pool
//
// The worker tasks are cooperatively cancellable, meaning they
// will check stop signal and exit gracefully if a shutdown is requested.
//
// To compile and run this code, you will need to have a C++20-compliant
// compiler and the stdexec library available. Stdexec is header-only,
// you can simply include the repository's 'include' directory in your
// project's include path.
//
// You can get the library here: https://github.com/NVIDIA/stdexec

#include <iostream>
#include <vector>
#include <string>
#include <thread>
#include <chrono>
#include <stdexec/execution.hpp>
#include <exec/static_thread_pool.hpp>
#include <csignal> // For signal handling

// Platform-specific headers for getting core count across all NUMA nodes
#ifdef _WIN32
#include <windows.h>
#include <conio.h> // For _kbhit() and _getch()
#elif defined(__linux__)
#include <unistd.h>
#endif

// The stdexec library uses these namespaces
namespace ex = stdexec;

// Global stop_source, so the signal handler can access it.
// Cross-platform equivalent of your IOCP shutdown messaging.
std::stop_source global_stop_source;

// A simple task that performs work. It is designed to be cancelled.
void worker_task(std::stop_token stoken, unsigned long long task_id) {
// Check for a stop request before starting
if (stoken.stop_requested()) {
// Task cancelled before starting.
return;
}

// Task started.

// Simulate work in a loop, checking for cancel in each iteration.
for (int i = 0; i < 5; ++i) {
if (stoken.stop_requested()) {
// Task received shutdown signal. Exiting gracefully.
return; // Graceful exit
}
std::this_thread::sleep_for(std::chrono::milliseconds(200));
}

// Task completed successfully.
}

// A C-style signal handler function. It's called when a signal like
// CTRL-C is received.
void signal_handler(int signal) {
if (signal == SIGINT) {
// SIGINT received. Requesting graceful shutdown...
global_stop_source.request_stop();
}
}

// Function to get total number of logical cores across all NUMA nodes.
// This provides a more accurate count than std::thread::hardware_concurrency()
// on some multi-socket or multi-NUMA systems.
unsigned int get_all_logical_cores() {
#ifdef _WIN32
// Windows implementation using GetActiveProcessorCount with the ALL_PROCESSOR_GROUPS parameter.
// This correctly counts logical cores across all processor groups,
// for more accurate and robust result for high-core-count machines.
return GetActiveProcessorCount(ALL_PROCESSOR_GROUPS);
#elif defined(__linux__)
// Linux implementation
return sysconf(_SC_NPROCESSORS_ONLN);
#else
// Fallback for other platforms
return std::thread::hardware_concurrency();
#endif
}

// This function acts as the producer. Runs on a worker thread and submits
// a large number of tasks to the thread pool.
void producer_task(ex::scheduler auto scheduler, std::stop_token stoken, const unsigned long long num_tasks) {
// Producer task started on a worker thread.

for (unsigned long long i = 0; i < num_tasks; ++i) {
if (stoken.stop_requested()) {
// Producer task received shutdown signal. Stops task.
break; // Stop producing tasks
}

// Create a sender for each task and submit it to the scheduler.
// Dynamic, memory-efficient way to handle large number of tasks.
auto task_sender = ex::schedule(scheduler)
| ex::then([stoken, i] {
worker_task(stoken, i);
});

// Submits task to the scheduler without blocking producer thread.
// start_detached is a "sender consumer" that kicks off work and
// detaches its lifetime, so we don't have to wait for it here.
ex::start_detached(std::move(task_sender));
}
// Producer task finished submitting tasks.
}

int main() {
// Note: Signal handling can be unreliable in some IDEs.
// We will use a more robust input-based shutdown mechanism below.
// std::signal(SIGINT, signal_handler);

// 1. Create a thread pool with a fixed number of workers.
const int num_workers = get_all_logical_cores();
exec::static_thread_pool pool(num_workers);

// Get a scheduler from the thread pool.
auto scheduler = pool.get_scheduler();

// 2. The main thread launches the producer task.
std::cout << "Main thread launching producer task." << std::endl;

const unsigned long long num_tasks = 200000000000;

// The stop_token is retrieved from the global stop_source.
auto stop_token = global_stop_source.get_token();

// The producer_task is wrapped in a sender.
auto producer_sender = ex::schedule(scheduler)
| ex::then([=, stoken=stop_token] {
producer_task(scheduler, stoken, num_tasks);
});

std::cout << "Producer task has been launched. The program will now run until a shutdown signal is received." << std::endl;
std::cout << "Press ENTER to request a graceful shutdown." << std::endl;

// Launch the producer task in a background thread.
// Main thread will then wait for user input to trigger a shutdown.
std::jthread producer_main_thread([&](){
ex::sync_wait(std::move(producer_sender));
});

// Wait for the user to press a key.
std::cin.get();

std::cout << "\nRequesting graceful shutdown..." << std::endl;
global_stop_source.request_stop();

// The jthread for the producer automatically joins upon exiting main.
// The thread pool also joins its threads upon destruction.
std::cout << "\nAll tasks and worker threads have shut down gracefully. Exiting main." << std::endl;

return 0;
}

Summary

In summary, the introduction of the Sender/Receiver pattern in C++26 marks a pivotal moment for the language. This highly-advanced, asynchronous construct finally provides a powerful and portable foundation for implementing sophisticated design patterns that were previously only possible using platform-specific APIs like IOCP or io_uring. The example code above, which demonstrates a scalable producer-consumer model with graceful shutdown and robust fault tolerance, merely scratches the surface of what's possible. The Sender/Receiver pattern can also be used for highly scalable device I/O for files, sockets, pipes, and serial communications. The ISO C++ community has finally closed a critical language gap, a monumental achievement that ushers in a new age of cross-platform HPC that is not a moment too soon for the AI revolution.