← Blog

Martin Chapman

High Performance Async I/O: IOCP vs. I/O Rings - Part I: The New Kid on the Block

IoRingAPI isn't a direct replacement for IOCP, but a specialized tool that provides a new level of performance for device I/O.

High Performance Async I/O: IOCP vs. I/O Rings - Part I: The New Kid on the Block

Hello everyone,

In my previous articles, I've often touched on the importance of high-performance asynchronous I/O when discussing the "physics of software development" and other articles. It's a topic that demands a much deeper exploration, which is why I'm dedicating this new three-part series to the subject. I hope at least a few of you will enjoy these geeky articles. LOL!

For decades, Windows developers building multi-threaded applications have relied heavily on the raw, unparalleled power of I/O Completion Ports (IOCP). It's a critical tool in the arsenal of any programmer serious about creating scalable, high-throughput I/O systems. Now, with the release of Windows 11, Microsoft has introduced a new, powerful API called IoRingAPI that was inspired by the io_uring framework on Linux. This isn't a direct replacement for IOCP, but a specialized tool that provides a new level of performance for device I/O. It offers a new, powerful alternative to traditional completion routines and IOCP for specific high-performance tasks.

This article is the first in a three-part series where I will explore this new API. My goal is threefold:

  • Part I: The New Kid on the Block. I will introduce the I/O Rings, explore its core concepts, and see how it can be used for single- and multi-threaded I/O operations.
  • Part II: A Tale of Two Titans. I will dive deep into the architecture of both IOCP and the new I/O Rings, comparing their strengths, weaknesses, and ideal use cases. I will also demonstrate how to combine them to build a truly masterful piece of high-performance I/O code, leveraging the best of both APIs.
  • Part III: The Synergy. I will discuss how I/O Rings are part of Microsoft's broader strategy for high-performance I/O, including its relationship with BypassIO and the DirectStorage API.

Let's begin by taking a deep dive into the I/O Rings and why it's generating so much excitement.

Introducing I/O Rings: The Windows Take on io_uring

The I/O Rings API is a powerful new asynchronous I/O framework available on Windows 11. It was heavily inspired by the highly successful io_uring API on Linux and brings a similar, queue-based architecture to the Windows kernel.

At its core, I/O Rings is built on a shared-memory ring buffer architecture. This model consists of two primary queues:

  • Submission Queue (SQ): The application uses this queue to submit I/O requests to the kernel.
  • Completion Queue (CQ): The kernel uses this queue to post completion results for I/O operations.

The key to I/O Rings' performance is that the application and the kernel share these ring buffers. An application can place a batch of requests in the submission queue, and the kernel can pull them off without the need for a costly context switch for each individual request. This dramatically reduces overhead and allows for incredibly high I/O throughput.

Furthermore, IoRingAPI provides specific mechanisms to handle multiple files efficiently. Instead of passing file handles and buffer information with every single I/O request, the API allows for pre-registration. You can register an array of file handles and an array of data buffers with the kernel. Subsequent I/O requests can then simply reference these pre-registered resources by their index. This further reduces the system call overhead by eliminating the need for repeated validation and data copying. For a use case involving multiple small I/O operations across different files, such as a database system, this batching and pre-registration capability allows the kernel to intelligently schedule and execute the requests, leading to a significant performance boost.

In fact, Paul Moore, a principal software engineer at Microsoft, has publicly praised the speed of io_uring on Linux, noting its impressive performance gains. The fact that Microsoft has now implemented its own version, IoRingAPI, to remain competitive is a powerful form of flattery, as it demonstrates the effectiveness of the original design.

I/O Rings vs. io_uring: The Similarities and Differences

I/O Rings closely mirrors the core design principles of io_uring, which provides an excellent mental model for Linux developers migrating to Windows. Both APIs feature:

  • Shared Ring Buffers: Both use a shared-memory architecture to minimize the user-to-kernel transition overhead.
  • Batching: Both allow an application to submit a large number of I/O requests at once, which the kernel can process more efficiently.
  • Asynchronous-by-Design: Both APIs are inherently asynchronous, enabling applications to overlap computation and I/O.

However, a key difference is that I/O Rings is currently a much more focused API. While io_uring on Linux has grown to support a wide range of operations, including networking and even user-space polling, I/O Rings is currently targeted at a specific set of device I/O operations. It is not intended as a full replacement for the vast functionality of IOCP but rather as an additional, specialized tool for specific performance-critical I/O tasks.

Using I/O Rings for I/O

The I/O Rings can be used in both a single-threaded, blocking fashion and a multi-threaded, non-blocking fashion.

Single-Threaded, Blocking Usage

In a single-threaded context, a developer can submit an I/O request and then block until it completes. This is achieved using SubmitIoRing() in blocking mode. This model is useful for simple I/O tasks where the goal is to offload the I/O to the kernel without the complexity of a multi-threaded architecture.

Multi-Threaded, Non-Blocking Usage

The real power of I/O Rings becomes apparent in a multi-threaded, non-blocking scenario. In this model, multiple application threads can push I/O requests onto the submission queue using SubmitIoRing(). They do not block, but rather continue with other work. When the kernel completes the I/O, it places a completion event onto the shared completion queue. A dedicated pool of worker threads can then pull these completion events off the queue using PopIoRingCompletion(), processing the results as they become available. This model mirrors the powerful, event-driven architecture of IOCP, allowing for massive I/O concurrency and scalability. In the next article of this series, I will dive further into this multi-threaded use case and provide sample code.

A Simple C++ Class for I/O Rings

To demonstrate the power and simplicity of this new API, here is a C++ class that wraps the I/O Rings to perform a blocking read operation. This class uses I/O Rings in a single-threaded blocking mode to fill a user-provided buffer with fixed length records from a file such as a database table or tiles in an image file. After the read operation completes, I iterate the bytes in the user supplied buffer and extract each record. The internal mechanism uses SubmitIoRing() to push the read request to the kernel and then blocks until the data is available. Records in the file are aligned to a multiple of the disk logical sector size as required by overlapped read / write operations on certain hard drives.

This example is intended to showcase the elegance and simplicity of the API, and as a learning guide for this series of articles. A more advanced example would pre-register multiple file handles and multiple buffers to really demonstrate the full power of IoRingAPI. In that scenario, a user would register handles and buffers once at startup, and then subsequently perform many reads and/or writes across multiple files avoiding even more access checks and context switches. A practical use-case scenario might be repeated access to one or more database files during the lifetime of an application to retrieve records that are distributed across many files. A second use case scenario might be to retrieve tiles from an image where the tiles are distributed across one or more files, such as reduced resolution datasets that exist in separate side-car files. I will provide a more full-featured class that demonstrates that type of usage in Part II: A Tale of Two Titans of this series of articles. Please feel free to copy this class and use it anyway you like in your own code.

#include <windows.h>
#include <ioringapi.h>
#include <string>
#include <vector>

#pragma comment(lib, "onecoreuap.lib")

// simple class to read bytes using windows ioringapi
// this class allows a user to read random blocks
// (records) in a synchronous manner but behind the
// scenes it uses IoRing to optimize I/O operations by
// minimizing kernel-mode switches and associated overhead

class FileIoRing
{
public:

// io block struct
struct IoBlock
{
unsigned long long offset{0};
unsigned int length{0};
};

// default constructor
FileIoRing(void)
{
// initialze variables
mFileHandle = INVALID_HANDLE_VALUE;
mIoRingHandle = nullptr;
mBlockAlignment = 512;
}

// virtual destructor
virtual ~FileIoRing(void)
{
// close
Close();
}

// simple open function for existing file
bool Open(const char* fileName)
{
// close existing
Close();

// validate file name
if (fileName == nullptr)
return false;

// open file as overlapped for aynchronous usage
// and disable operating system buffering
mFileHandle = ::CreateFile(fileName, GENERIC_READ, FILE_SHARE_READ, nullptr, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL | FILE_FLAG_OVERLAPPED | FILE_FLAG_NO_BUFFERING, nullptr);
if (mFileHandle == INVALID_HANDLE_VALUE)
{
Close();
return false;
}

// ioring flags
IORING_CREATE_FLAGS flags = {};
flags.Required = IORING_CREATE_REQUIRED_FLAGS_NONE;
flags.Advisory = IORING_CREATE_ADVISORY_FLAGS_NONE;

// create ioring handle and set the queues to their max
HRESULT hr = ::CreateIoRing(IORING_VERSION_4, flags, 0x10000, 0x20000, &mIoRingHandle);
if (hr != S_OK)
{
Close();
return false;
}

// register file handle
HANDLE handles[1] = { mFileHandle };
hr = ::BuildIoRingRegisterFileHandles(mIoRingHandle, 1, handles, 0);
if (hr != S_OK)
{
Close();
return false;
}

return true;
}

// close file
void Close()
{
// close ioring handle
if (mIoRingHandle != nullptr)
{
::CloseIoRing(mIoRingHandle);
mIoRingHandle = nullptr;
}

// close file handle
if (mFileHandle != INVALID_HANDLE_VALUE)
{
::CloseHandle(mFileHandle);
mFileHandle = INVALID_HANDLE_VALUE;
}
}

// read file in aligned asynchronous blocks internally
// using ioring but block function so it appears
// as syncronous to the caller and that way you get the
// best of both worlds, an easy to use fast async reader
bool ReadBlocks(unsigned char* buffer,
const unsigned int& length,
const std::vector< IoBlock>& blocks) const
{
// validate open
if (IsOpen() == false)
return false;

// validate buffer and length
if (buffer == nullptr)
return false;

// validate blocks
if (blocks.size() == 0)
return false;

// validate block offsets and lengths
unsigned int totalLength = 0;
for (const auto& block : blocks)
{
if (block.offset % mBlockAlignment != 0)
return false;

if (block.length % mBlockAlignment != 0)
return false;

totalLength += block.length;
}

// validate buffer length
if (length != totalLength)
return false;

// queue asycronous block reads
HRESULT hr = S_FALSE;
unsigned char* scan0 = buffer;
for (const auto& block : blocks)
{
hr = ::BuildIoRingReadFile(mIoRingHandle, ::IoRingHandleRefFromIndex(0), ::IoRingBufferRefFromPointer(scan0), block.length, block.offset, NULL, IOSQE_FLAGS_NONE);
if (hr != S_OK)
return false;

scan0 += block.length;
}

// submit ioring entries but wait until finished, behind
// the scenes the read buffer will be filled as quickly
// as possible in async kernel mode
unsigned int submittedEntries = 0;
hr = ::SubmitIoRing(mIoRingHandle,
(unsigned int) blocks.size(),
INFINITE,
&submittedEntries);
if (hr != S_OK) return false;

return true;
}

// check for open file handle
bool IsOpen() const {return (mFileHandle != INVALID_HANDLE_VALUE);}

private:

// disable unused functions to prevent resource issues.
FileIoRing(const FileIoRing&) = delete;
FileIoRing& operator=(const FileIoRing&) = delete;
FileIoRing(FileIoRing&&) = delete;
FileIoRing& operator=(FileIoRing&&) = delete;

// member variables
HANDLE mFileHandle;
HIORING mIoRingHandle;
unsigned int mBlockAlignment;
};

// application entry point

int main(int argc, char* argv[])
{
// avoid compiler warning
(void) argc; (void) argv;

// example record size
const unsigned int RecordSize = 4096;

// load records to read - could be records at random locations
// this example skips a record length for every record
// these records could be random records in a database file
unsigned int bufferLength = 0;
std::vector< FileIoRing::IoBlock> blocks;
for (int i = 0; i < 200; i += 2)
{
FileIoRing::IoBlock block;
block.offset = i * RecordSize;
block.length = RecordSize;
blocks.push_back(block);
bufferLength += block.length;
}

// open file for read using IoRing behind the scenes
FileIoRing fileIoRing;
if (fileIoRing.Open("<file_path_to_open>") == false)
return 1;

// allocate aligned memory buffer
unsigned char* buffer = (unsigned char*) ::VirtualAlloc(nullptr, bufferLength, MEM_COMMIT, PAGE_READWRITE);
if (buffer == nullptr) return 1;

// read blocks
unsigned long long position = 0;
if (fileIoRing.ReadBlocks(buffer, bufferLength, blocks) == false)
{
::VirtualFree(buffer, 0, MEM_RELEASE);
return 1;
}

// close file
fileIoRing.Close();

// do something with the data...
// in this example all records returned
// are packed into a single buffer
unsigned char record[RecordSize] = {};
unsigned char* scan0 = buffer;
for (const auto& block : blocks)
{
// scan0 is pointing to the
// beginning of your record
memset(record, 0, RecordSize);
memcpy(record, scan0, block.length);
scan0 += block.length;
}

// free aligned memory block
::VirtualFree(buffer, 0, MEM_RELEASE);

return 0;
}

Special Thanks

I want to give a special thanks to Yarden Shafir and Alex Ionescu at Winsider for their invaluable work in providing essentially the only information available on the web about IoRingAPI. Their extensive research and clear explanations have been essential to understanding this new and powerful API. I would definitely encourage my readers to visit their website and read further about the IoRingAPI.

Looking Ahead to Part II

In our next article, I will take a much deeper look into how to use the I/O Rings in a multi-threaded context. I will compare this model directly with the venerable IOCP, exploring the similarities and differences in their APIs, performance characteristics, and the underlying kernel mechanisms that make them so powerful. I will also begin my journey toward building a hybrid, high-performance I/O solution that leverages the unique strengths of both APIs.

Stay tuned.

What are your thoughts? Do you have any comments or corrections?