← Blog

Martin Chapman

High Performance Async I/O: IOCP vs. I/O Rings - Part III: The Synergy

This is the final article in my three-part series on high-performance asynchronous I/O.

High Performance Async I/O: IOCP vs. I/O Rings - Part III: The Synergy

Hello everyone,

This is the final article in my three-part series on high-performance asynchronous I/O. In Part I, I introduced Microsoft's new I/O Rings API, exploring its core concepts like the shared-memory ring buffer architecture and its use for both single- and multi-threaded I/O operations. I also provided a simple C++ class to demonstrate its elegance and raw speed. In Part II, I moved on to a head-to-head comparison, pitting the new I/O Rings against the venerable I/O Completion Ports (IOCP). I discovered that rather than being rivals, they form a powerful synergy, with I/O Rings excelling at raw I/O throughput and IOCP at scalable, parallel processing. I illustrated this hybrid model with a tiled image viewer and a high-performance video renderer, demonstrating how to leverage the best of both APIs. In this final installment, I'll reveal how I/O Rings is not just a standalone API but a key component in Microsoft's grander strategy to revolutionize the Windows storage stack. My goal is to show how these APIs work in synergy to unlock a new era of high-performance computing, a strategy that started with the release of Windows 11.

Let's dive in.

The New I/O Strategy and the Gaming Bottleneck

For decades, the Windows I/O stack was designed for Hard Disk Drives (HDDs). These "spinning rust" disks operate mechanically, with a physical head moving across a platter to read data. This mechanical nature means that random I/O operations are slow due to seek time. To hide this latency, operating systems and APIs evolved to handle I/O asynchronously, allowing applications to continue working while waiting for the disk to respond. The entire software architecture was built around this single bottleneck: the physical limitations of the hard drive.

The advent of NVMe (Non-Volatile Memory Express) Solid-State Drives (SSD) changed everything. Unlike HDDs, NVMe SSDs have no moving parts and are connected directly to the CPU via the high-speed PCIe bus. This fundamentally changes the bottleneck from the disk itself to the software that manages it. For example, a single read request on a traditional HDD might take 5-10 milliseconds, but on an NVMe SSD, the same operation can take as little as 10-20 microseconds. This monumental speed shift meant the industry went from being I/O-bound to being CPU-bound, with a significant amount of CPU time being wasted on I/O overhead. This legacy I/O stack, designed to handle latency by batching small I/O requests, became a significant source of overhead, which disproportionately affects performance-hungry applications like video games and content creation software. Modern games are massive, consisting of tens of thousands of small, compressed files for textures, models, and audio. Loading these assets from disk and getting them ready for rendering on the GPU is a major performance challenge. On a traditional system, the CPU is burdened with handling thousands of individual I/O requests, each requiring a costly context switch to the kernel. Furthermore, after the data is read, the CPU must also decompress and prepare it for the GPU. This process is highly inefficient, leading to long loading screens and in-game "stutters" as the system struggles to keep up.

This bottleneck isn't limited to games; it also severely impacts professional applications like financial modeling and data analytics, Geographic Information Systems (GIS), video editing software, and scientific computing platforms. Financial modeling and data analytics software, for example, need to rapidly load and process terabytes of data from storage, but the I/O bottleneck prevents them from feeding the data to the CPU and GPU at the speed required for real-time analysis. Video editors, for example, need to stream multiple high-resolution video and audio tracks simultaneously without stuttering, often requiring the creation of low-resolution proxy files to work efficiently. A GIS application might need to rapidly load thousands of small image tiles or terrain data files from disk to build a high-resolution, interactive map. This includes loading and rendering complex 3D terrain meshes and massive LiDAR point clouds, which require the streaming of billions of points of data from storage. The current I/O stack and APIs severely limit the performance of these applications.

The ultimate speed goal for Microsoft is to close the performance gap between what NVMe SSDs can provide and what the operating system can deliver. While a high-end NVMe drive can achieve raw throughput exceeding 14 GB/s and millions of I/O operations per second (IOPS), legacy APIs (Win32 API) often cap out at around 1-2 GB/s due to CPU overhead. The new I/O strategy aims to eliminate this overhead, allowing applications to leverage the full speed of the hardware. This is the impetus for creating new APIs like I/O Rings and BypassIO.

A Closer Look at the I/O Model

The fundamental difference between the old and new Windows I/O paradigms can be described as a shift from a push to a pull model. In a traditional IOCP-based system, the application pushes a request to the kernel and then blocks or waits for the kernel to push a completion packet back onto the completion port's queue. The kernel is responsible for all thread management and scheduling of completions.

The new I/O Rings model, by contrast, is a pull-based system. The application pushes a batch of requests onto the submission queue and then actively pulls completion events from the shared completion queue. This model gives the application much more control over its I/O loop, reducing the need for costly context switches and allowing it to use a busy-wait or polling loop for extremely low-latency scenarios, mirroring the highly optimized approach of Linux's io_uring. This philosophical shift is central to how I/O Rings achieves its raw throughput advantage by empowering the application to manage the I/O flow.

The BypassIO I/O Path

Before I dive into DirectStorage, it's crucial to understand BypassIO, an optimized I/O path for non-cached reads in Windows 11. Traditionally, a file read request navigates a labyrinthine I/O stack, where essential third-party software—such as antivirus scanners, backup utilities, and virtualization tools—intercepts and processes the data through filter drivers. While critical for security and data integrity, these interventions add layers of latency and overhead. BypassIO, however, is engineered to sidestep this complexity, carving out a direct, high-speed path from the application to the storage device. If BypassIO is the high-speed highway, then I/O Rings is the powerful race car that can drive on it.

For a file to use the BypassIO path, the I/O request must come from a supported application and travel through a file system and storage driver that have both opted-in to the new model. Microsoft is actively collaborating with third-party vendors to upgrade their drivers and ensure their software is compatible with, and does not interfere with, this new paradigm. This optimized path provides the necessary infrastructure to handle a massive volume of I/O requests with minimal latency. While this path is currently limited to NVMe devices and the NTFS file system, the vision is for it to become the standard for all high-performance storage. The synergy between a low-level API like I/O Rings and the BypassIO path is what allows developers to achieve new levels of performance.

DirectStorage and I/O Rings

TheDirectStorage API is a high-level API designed for game developers to rapidly load and stream game assets. While it serves a different purpose than I/O Rings, the two are deeply intertwined. DirectStorage leverages the low-level optimizations provided by I/O Rings. It uses the same queueing and batching model to handle hundreds or thousands of I/O requests at once, and it relies on the BypassIO path to ensure those requests hit the NVMe SSD with minimal latency.

A key feature of DirectStorage is its ability to offload data decompression to the GPU. By using the GPU's parallel processing power to decompress assets, it frees up the CPU to handle other game logic. This is a massive win for modern games with their high-resolution textures and complex 3D models.

In short, I/O Rings and BypassIO provide the low-level "plumbing" for a high-performance I/O channel, while DirectStorage is the high-level "application" that uses this channel to provide a specific, highly optimized solution for gaming. This synergy allows developers to achieve new levels of performance that were simply not possible with traditional APIs like IOCP alone.

The Long-Term Vision

While DirectStorage is the most prominent application of this new I/O strategy, it’s just the beginning. The long-term vision is for BypassIO and I/O Rings to become the standard, foundational I/O path for all I/O-intensive applications on Windows. For developers of high-performance software—whether they are building enterprise databases, scientific computing platforms, GIS applications, or video editing suites—the new model offers a clear, modern path to optimal performance. Public discussions about the new C++ std::io library, for instance, demonstrate this vision. This high-level, portable API is being designed to transparently leverage underlying kernel features like I/O Rings and other modern APIs to provide superior asynchronous I/O performance. The old I/O stack, with its layers of filters and system call overhead, will become a legacy path. As more third-party software and drivers are updated to support BypassIO, the performance benefits will extend to an ever-widening range of applications. The goal is to make the entire Windows I/O subsystem more streamlined and efficient, enabling developers to build a new generation of software that can stream massive amounts of data in real-time without compromising system performance. This will ultimately unlock the full potential of today's powerful hardware for everyone.

While the world's attention often gravitates towards the dazzling advancements in AI and Machine Learning, a silent revolution is unfolding in the realm of High-Performance Computing and asynchronous I/O. For those with a keen eye on these seismic shifts, the stage is set to craft the next generation of groundbreaking software, harnessing the raw, unbridled power of I/O Rings, BypassIO, and other exhilarating advancements in high-performance computing. The future is not just intelligent; it's blazingly fast.

Looking Ahead

This series has explored the evolution of high-performance I/O on Windows, from the venerable IOCP to the new I/O Rings API. We've seen how these tools, combined with modern technologies like BypassIO and DirectStorage, are part of a unified strategy to unlock the full potential of today's hardware.

In my coming articles, I will put this knowledge into practice. I will show you how to tie these concepts together to build truly cutting-edge, high-performance applications. By streamlining the process of loading complex data—from 3D geometry to ultra-high-resolution imagery—directly from disk to GPU, I will push the physical limits of a single machine to the near breaking point. I will then illustrate how to scale these triumphs, building massively parallel data processing pipelines that utilize multiple computers, multiple processes, multiple CPU cores, and multiple GPUs in a relentless quest for Beast Mode HPC.

Stay tuned.

What are your thoughts? Do you have any comments or corrections?