High Performance Async I/O: IOCP vs. I/O Rings - Part II: A Tale of Two Titans
Now, I investigate a direct comparison between this "new kid on the block" and the long-standing champion of Windows asynchronous I/O, I/O Completion Ports (IOCP).
Hello everyone,
In the previous article, I introduced the new I/O Rings API on Windows 11, exploring its core shared-memory architecture and its promise for high-performance device I/O. I also looked at a simple, single-threaded implementation. Now, I investigate a direct comparison between this "new kid on the block" and the long-standing champion of Windows asynchronous I/O, I/O Completion Ports (IOCP). This is not a contest to declare a single winner, but rather an exploration of two powerful tools and how they can be used most effectively.
Let's dive in.
The Champion: I/O Completion Ports (IOCP)
For over two decades, IOCP has been the go-to solution for building scalable, high-throughput I/O servers on Windows. Its elegance lies in its completion-based model. When you set up an IOCP, you associate multiple I/O handles (files, pipes, sockets, etc.) with a single completion port. When an asynchronous I/O operation on one of these handles completes, the operating system places a "completion packet" onto the queue of the IOCP.
A key feature of IOCP is its intelligent thread management. By using functions like GetQueuedCompletionStatus(), a pool of worker threads can efficiently wait for and process these completion packets. The kernel automatically manages the number of running threads, ensuring that there are just enough to keep the CPU busy with I/O-related work, while preventing an excess of threads from causing costly context switching. This "one-to-many" model allows a small number of threads to handle thousands of concurrent I/O operations, making it ideal for web servers, game servers, and other high-concurrency network applications.
The Challenger: I/O Rings
As I discussed, I/O Rings (or IoRingAPI on Windows) operates on a fundamentally different, ring-buffer model. Instead of relying on the kernel to queue completion packets, the application and the kernel share two ring buffers in memory: the submission queue (SQ) and the completion queue (CQ).
The application populates the SQ with requests and then submits them to the kernel in a single batch. The kernel processes these requests and places the results in the CQ. The application can then check the CQ for completions. This shared-memory, batched approach drastically reduces the number of system calls. While an IOCP-based system might require multiple system calls to set up and complete a single I/O operation, I/O Rings can handle hundreds or thousands of operations with just a few calls.
A Head-to-Head Comparison
IOCP vs. I/O Rings
In simple terms, think of it this way: IOCP is like a highly efficient post office. You drop off your mail (I/O requests), and the post office (the kernel) handles all the sorting and delivery, notifying you when a letter arrives in your mailbox (the completion queue). I/O Rings, on the other hand, is like a direct, high-speed conveyor belt between your desk and the factory (the kernel). You pile a bunch of orders (requests) onto the belt and they're processed instantly, with the finished goods appearing back on your side of the belt. The post office is great for general-purpose delivery, but the conveyor belt is unbeatable if you're moving a huge volume of goods between two specific points.
The Hybrid Model: A Powerful Synergy
So, does this mean I/O Rings will replace IOCP? Not at all. For general-purpose networking and most I/O workloads, IOCP remains the more robust, mature, and easier-to-manage solution due to its automatic thread management. It excels at handling many different types of I/O concurrently without developer intervention.
The true power of IoRingAPI lies in its ability to handle specialized, performance-critical tasks. A masterful I/O solution would use IOCP as its primary, general-purpose I/O engine while offloading specific, high-throughput device I/O tasks to a separate I/O Rings-based subsystem.
A perfect example of this is a tiled image viewer. The application needs to read large chunks of data (the image tiles) from a file as quickly as possible and then render these tiles in parallel to the window. The I/O portion (reading the tiles) is a high-throughput, latency-sensitive task. The rendering portion is a CPU-intensive, multi-threaded task.
A hybrid solution, as demonstrated in the C++ sample application below, leverages the strengths of both APIs:
- I/O Rings is used to efficiently read a batch of tiles from the file. All the I/O requests are submitted to the kernel in a single batch, minimizing the overhead of user-to-kernel mode switches.
- As each tile's read operation completes, the I/O Rings completion is then posted to a traditional IOCP. This is where IOCP's strength in thread management shines. A dedicated pool of worker threads, managed by the IOCP, picks up each completed tile and renders it to the window in parallel.
This hybrid approach ensures the I/O is handled with minimal overhead, while the rendering is executed concurrently across multiple CPU cores, resulting in a highly responsive and fast application.
Asynchronous vs. Raw Speed: A Clarification
A common misconception in high-performance computing is that asynchronous I/O is inherently about making a single operation faster. In reality, that's not its primary goal. The true power of asynchronous I/O lies in its ability to enable scalability. By not blocking a thread while waiting for an I/O operation to complete, a single thread can initiate multiple operations and then perform other useful work. This allows an application to handle a far greater number of concurrent tasks with a limited pool of threads, making the entire system more efficient.
In the context of IOCP vs. I/O Rings, this distinction becomes clear:
- IOCP is a scalability champion. It's a brilliant abstraction that excels at managing a large number of concurrent I/O operations (scalability). It's designed to efficiently handle thousands of network connections or file operations by dynamically adjusting the number of threads actively processing completions. It ensures your application doesn't get bogged down in thread management overhead, making it highly scalable and parallel.
- I/O Rings is a raw speed champion. Its unique shared-memory, ring-buffer design and batching capabilities are all about reducing the per-operation overhead. By minimizing costly user-to-kernel mode switches and reducing syscalls, it offers superior raw throughput for device I/O. It doesn't provide the high-level thread management of IOCP; instead, it provides a high-speed pipe for moving data as fast as possible.
Therefore, the statement "IOCP is scalable, parallel processing where IoRings are raw speed" is a very accurate and concise way to describe their fundamental differences. One is a master of efficient, concurrent management, and the other is a master of brute-force I/O throughput. The hybrid model I demonstrated earlier leverages both strengths to achieve a solution that is both highly scalable and incredibly fast.
The sample application demonstrates a hybrid, high-performance solution for a tiled image viewer. A well written production viewer would avoid rendering from disk for every WM_PAINT message, but for this example we want to crush the hard drive with read requests. It uses I/O Rings to efficiently read a batch of image tiles from a file, and then leverages IOCP to dispatch the rendering of those tiles to a pool of worker threads for parallel processing. The overall effect is that the application is able to read and display a batch of random tiles from a large image off disk with incredible speed and responsiveness.
Sample Code: A Hybrid Tiled Image Renderer
#include <windows.h>
#include <ioringapi.h>
#include <string>
#include <vector>
#include <thread>
#include <iostream>
#include <fstream>
#include <cstdlib>
#include <ctime>
#pragma comment(lib, "onecoreuap.lib")
// simple class to pass to the completion queue
// when using iocp that holds a buffer offset
// length, iocp key, and completion event
class IoTile :
public OVERLAPPED
{
public:
enum IoKey
{
IoKeyNone,
IoKeyIo,
IoKeyRead,
IoKeyWrite,
IoKeyShutdown,
};
IoTile(void)
{
hEvent = NULL;
Internal = 0;
InternalHigh = 0;
Offset = 0;
OffsetHigh = 0;
Pointer = NULL;
mPixelX = 0;
mPixelY = 0;
mUserData = NULL;
mIoKey = IoKeyNone;
mNumBytesTransferred = 0;
mCompletionEvent = ::CreateEvent(NULL, TRUE, FALSE, NULL);
::ResetEvent(mCompletionEvent);
}
virtual ~IoTile()
{
if (mCompletionEvent != NULL)
{
::CloseHandle(mCompletionEvent);
mCompletionEvent = NULL;
}
}
INT64 GetOffset() const
{
LARGE_INTEGER li = {0};
li.LowPart = Offset;
li.HighPart = OffsetHigh;
return li.QuadPart;
}
void SetOffset(const INT64& offset)
{
LARGE_INTEGER li = {0};
li.QuadPart = offset;
Offset = li.LowPart;
OffsetHigh = li.HighPart;
}
void GetOffset(LARGE_INTEGER& offset) const
{
memset(&offset, 0, sizeof(LARGE_INTEGER));
offset.LowPart = Offset;
offset.HighPart = OffsetHigh;
}
void SetOffset(const LARGE_INTEGER& offset)
{
Offset = offset.LowPart;
OffsetHigh = offset.HighPart;
}
DWORD WaitCompleted(const DWORD milliseconds = INFINITE)
{
return ::WaitForSingleObject(mCompletionEvent, milliseconds);
}
void SetNumBytesTransferred(DWORD numBytesTransferred)
{
mNumBytesTransferred = numBytesTransferred;
}
void AddNumBytesTransferred(DWORD numBytesTransferred)
{
mNumBytesTransferred += numBytesTransferred;
}
UINT GetPixelX() const {return mPixelX;}
void SetPixelX(const UINT& pixelX) {mPixelX = pixelX;}
UINT GetPixelY() const {return mPixelY;}
void SetPixelY(const UINT& pixelY) {mPixelY = pixelY;}
void* GetUserData() const {return mUserData;}
void SetUserData(void* userData) {mUserData = userData;}
DWORD GetNumBytesTransferred() const {return mNumBytesTransferred;}
IoKey GetIoKey() const {return mIoKey;}
void SetIoKey(const IoKey& ioKey) {mIoKey = ioKey;}
BOOL SetCompleted() {return ::SetEvent(mCompletionEvent);}
BOOL ResetCompleted() {return ::ResetEvent(mCompletionEvent);}
operator HANDLE() const {return mCompletionEvent;}
private:
UINT mPixelX;
UINT mPixelY;
void* mUserData;
DWORD mNumBytesTransferred;
HANDLE mCompletionEvent;
IoKey mIoKey;
};
// simple class to hold aligned memory buffer
class MemoryAligned
{
public:
MemoryAligned() {}
virtual ~MemoryAligned(void) { Free(); }
HANDLE Alloc(const SIZE_T& length)
{
if (length == 0)
return nullptr;
Free();
mRequestedLength = length;
SIZE_T numChunks = (SIZE_T) ceil(((double) mRequestedLength)
/ ((double) 0x10000));
mAlignedLength = numChunks * 0x10000;
mHandle = ::VirtualAlloc(nullptr,
mAlignedLength,
MEM_COMMIT,
PAGE_READWRITE);
if (mHandle == nullptr) return nullptr;
::memset(mHandle, 0, mAlignedLength);
return mHandle;
}
void Free()
{
if (mHandle != nullptr)
::VirtualFree(mHandle, 0, MEM_RELEASE);
mHandle = nullptr;
mRequestedLength = 0;
mAlignedLength = 0;
}
SIZE_T GetLength() const {return mRequestedLength;}
operator void*() const {return mHandle;}
operator BYTE*() const {return (BYTE*) mHandle;}
private:
SIZE_T mRequestedLength{0};
SIZE_T mAlignedLength{0};
void* mHandle{nullptr};
};
// simple class to read image tiles using windows ioringapi
// and iocp. allows a user to read pixels in a asynchronous
// manner and uses IoRing to optimize I/O operations by
// minimizing kernel-mode switches and associated overhead
// and IO completion ports to syncronize thread communication
// and render tiles to the window in parallel
class FileIoRing
{
public:
// default constructor
FileIoRing(void) :
mHwnd(nullptr),
mFileHandle(INVALID_HANDLE_VALUE),
mIoRingHandle(nullptr),
mCompletionEvent(nullptr),
mIocpCompletionHandle(nullptr),
mIocpWorkerHandle(nullptr),
mNumTilesX(0),
mNumTilesY(0),
mTileSize(256),
mNumChannels(3),
mBitsPerChannel(8),
mNumLogicalCores(0) {}
// virtual destructor
virtual ~FileIoRing(void)
{
// close
Close();
}
// simple open function for existing tiled image file
bool Open(const char* filePath,
HWND hwnd,
const UINT64& numTilesX,
const UINT64& numTilesY,
const UINT& tileSize = 64)
{
// close existing
Close();
// validate parameters
if (filePath == nullptr ||
hwnd == nullptr ||
numTilesX == 0 ||
numTilesY == 0 ||
tileSize == 0 ||
tileSize % 2 != 0)
return false;
// set tile diemensions
mHwnd = hwnd;
mNumTilesX = numTilesX;
mNumTilesY = numTilesY;
mTileSize = tileSize;
// open file as overlapped for aynchronous usage
// and disable operating system buffering
mFileHandle = ::CreateFile(filePath,
GENERIC_READ,
FILE_SHARE_READ,
nullptr,
OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL |
FILE_FLAG_OVERLAPPED |
FILE_FLAG_NO_BUFFERING,
nullptr);
if (mFileHandle == INVALID_HANDLE_VALUE)
{
Close();
return false;
}
// get the system's capabilities.
IORING_CREATE_FLAGS createFlags = {};
IORING_CAPABILITIES capabilities = {};
HRESULT hr = ::QueryIoRingCapabilities(&capabilities);
if (hr != S_OK)
{
Close();
return false;
}
// check for emulation mode
if (capabilities.FeatureFlags &
IORING_FEATURE_UM_EMULATION)
{
Close();
return false;
}
// check for completion event
if (!(capabilities.FeatureFlags &
IORING_FEATURE_SET_COMPLETION_EVENT))
{
Close();
return false;
}
// ioring flags
IORING_CREATE_FLAGS flags = {};
flags.Required = IORING_CREATE_REQUIRED_FLAGS_NONE;
flags.Advisory = IORING_CREATE_ADVISORY_FLAGS_NONE;
// create ioring handle and set the queues to their max
hr = ::CreateIoRing(capabilities.MaxVersion,
flags,
capabilities.MaxSubmissionQueueSize,
capabilities.MaxCompletionQueueSize,
&mIoRingHandle);
if (hr != S_OK)
{
Close();
return false;
}
// register file handle
hr = ::BuildIoRingRegisterFileHandles(mIoRingHandle,
1,
&mFileHandle,
0);
if (hr != S_OK)
{
Close();
return false;
}
// create completion event
mCompletionEvent = ::CreateEvent(nullptr,
FALSE,
FALSE,
nullptr);
if (mCompletionEvent == nullptr)
{
Close();
return false;
}
// set completion event
hr = ::SetIoRingCompletionEvent(mIoRingHandle,
mCompletionEvent);
if (FAILED(hr))
{
Close();
return false;
}
// create I/O completion port for the completion thread communication
mIocpCompletionHandle = ::CreateIoCompletionPort(INVALID_HANDLE_VALUE,
nullptr,
0,
1);
if (mIocpCompletionHandle == nullptr)
{
Close();
return false;
}
// get number of logical cores, but needs 2 threads minimum
mNumLogicalCores = std::thread::hardware_concurrency();
if (mNumLogicalCores < 2) mNumLogicalCores = 2;
// create I/O completion port for the worker thread communication
mIocpWorkerHandle = ::CreateIoCompletionPort(INVALID_HANDLE_VALUE,
nullptr,
0,
mNumLogicalCores - 1);
if (mIocpWorkerHandle == nullptr)
{
Close();
return false;
}
// create completion thread
mCompletionThread = std::thread(&FileIoRing::CompletionThread,
this);
// create worker threads
mWorkerThreads.resize(mNumLogicalCores - 1);
for (UINT i = 0; i < mNumLogicalCores - 1; ++i)
mWorkerThreads[i] = std::thread(&FileIoRing::WorkerThread,
this);
return true;
}
// close file
void Close()
{
// post shutdown key to completion thread
::PostQueuedCompletionStatus(mIocpCompletionHandle,
0,
IoTile::IoKeyShutdown,
nullptr);
// signal completion event
if (mCompletionEvent != nullptr)
::SetEvent(mCompletionEvent);
// wait for completion thread to exit
if (mCompletionThread.joinable() == true)
mCompletionThread.join();
// post shutdown key to worker threads
if (mNumLogicalCores > 0)
{
for (UINT i = 0; i < mNumLogicalCores - 1; ++i)
::PostQueuedCompletionStatus(mIocpWorkerHandle,
0,
IoTile::IoKeyShutdown,
nullptr);
}
// wait for worker threads to exit
for (auto& thread : mWorkerThreads)
{
if (thread.joinable())
thread.join();
}
// clear worker threads
mWorkerThreads.clear();
// close iocp completion handle
if (mIocpCompletionHandle != nullptr)
{
::CloseHandle(mIocpCompletionHandle);
mIocpCompletionHandle = nullptr;
}
// close iocp worker handle
if (mIocpWorkerHandle != nullptr)
{
::CloseHandle(mIocpWorkerHandle);
mIocpWorkerHandle = nullptr;
}
// close completion event handle
if (mCompletionEvent != nullptr)
{
::CloseHandle(mCompletionEvent);
mCompletionEvent = nullptr;
}
// close ioring handle
if (mIoRingHandle != nullptr)
{
::CloseIoRing(mIoRingHandle);
mIoRingHandle = nullptr;
}
// close file handle
if (mFileHandle != INVALID_HANDLE_VALUE)
{
::CloseHandle(mFileHandle);
mFileHandle = INVALID_HANDLE_VALUE;
}
// reset tile dimensions
mHwnd = nullptr;
mNumTilesX = 0;
mNumTilesY = 0;
mTileSize = 64;
mNumChannels = 3;
mBitsPerChannel = 8;
}
// read tiles asynchronous internally using ioring
// and iocp but block function so it appears as
// syncronous to the caller and that way you get the
// best of both worlds, an easy to use fast async reader
// that reads and renders tiles as fast as possible
bool ReadTiles(const UINT64& tileX,
const UINT64& tileY,
const UINT& numTilesX,
const UINT& numTilesY,
MemoryAligned& memoryAligned) const
{
// validate open
if (IsOpen() == false)
return false;
// validate tile paramters
if (tileX + numTilesX > mNumTilesX ||
tileY + numTilesY > mNumTilesY)
return false;
// num tiles
SIZE_T numTiles = (SIZE_T) (numTilesX * numTilesY);
if (numTiles == 0)
return false;
// compute tile length
SIZE_T tileLength = (SIZE_T) (mTileSize *
mTileSize *
mNumChannels *
(mBitsPerChannel / 8));
if (tileLength == 0)
return false;
// compute memory buffer length
SIZE_T bufferLength = (SIZE_T) (numTiles * tileLength);
if (bufferLength == 0)
return false;
// validate buffer length
if (memoryAligned.GetLength() < bufferLength)
return false;
memset(memoryAligned, 0, memoryAligned.GetLength());
// create io tiles
IoTile** ioTiles = new IoTile*[numTiles];
for (SIZE_T i = 0; i < numTiles; ++i)
ioTiles[i] = new IoTile;
// queue asycronous tile reads
HRESULT hr = S_FALSE;
BYTE* scan0 = memoryAligned;
int i = 0;
for (UINT64 y = tileY; y < (tileY + numTilesY); ++y)
{
for (UINT64 x = tileX; x < (tileX + numTilesX); ++x, ++i)
{
INT64 fileOffset = (INT64) ((y *
mNumTilesX *
tileLength) +
(x * tileLength));
ioTiles[i]->SetPixelX((UINT) (x * mTileSize));
ioTiles[i]->SetPixelY((UINT) (y * mTileSize));
ioTiles[i]->SetUserData(scan0);
ioTiles[i]->SetOffset(fileOffset);
ioTiles[i]->SetNumBytesTransferred((DWORD) tileLength);
ioTiles[i]->SetIoKey(IoTile::IoKeyRead);
hr = ::BuildIoRingReadFile(mIoRingHandle,
::IoRingHandleRefFromIndex(0),
::IoRingBufferRefFromPointer(scan0),
(UINT32) tileLength,
fileOffset,
(UINT_PTR) ioTiles[i],
IOSQE_FLAGS_NONE);
if (hr != S_OK)
{
ioTiles[i]->SetCompleted();
break;
}
scan0 += tileLength;
}
if (hr != S_OK)
break;
}
// submit ioring entries and wait after until finished,
// behind the scenes the read buffer will be filled as
// quickly as possible in async kernel mode and then
// tiles will be rendered in parallel using multiple
// threads simultaneosly
UINT submittedEntries = 0;
hr = ::SubmitIoRing(mIoRingHandle,
0,
0,
&submittedEntries);
if (hr != S_OK)
int g = 0;
// wait for all iotiles to complete
for (SIZE_T i = 0; i < numTiles && hr == S_OK; ++i)
{
if (ioTiles[i]->WaitCompleted() != WAIT_OBJECT_0)
hr = S_FALSE;
}
// delete iotiles
for (SIZE_T i = 0; i < numTiles; ++i)
delete ioTiles[i];
delete [] ioTiles;
return (hr == S_OK) ? true : false;
}
// check for open file handle
bool IsOpen() const {return (mFileHandle != INVALID_HANDLE_VALUE);}
private:
// completion thread
void CompletionThread() const
{
// wait for the completion event to be signaled.
::WaitForSingleObject(mCompletionEvent, INFINITE);
// process tiles until shutdown
for (;;)
{
ULONG_PTR completionKey = 0;
OVERLAPPED* overlapped = NULL;
DWORD numberOfBytesTransferred = 0;
// pull completion key from apc queue if one is
// available but don't wait if not
::GetQueuedCompletionStatus(mIocpCompletionHandle,
&numberOfBytesTransferred,
&completionKey,
&overlapped,
0);
// check key and exit thread if it is a shutdown key
IoTile::IoKey ioKey = (IoTile::IoKey) completionKey;
if (ioKey == IoTile::IoKeyShutdown)
break;
// pop completion packet
IORING_CQE cqe = {};
HRESULT hr = ::PopIoRingCompletion(mIoRingHandle, &cqe);
if (hr != S_OK)
{
::WaitForSingleObject(mCompletionEvent, INFINITE);
::ResetEvent(mCompletionEvent);
continue;
}
// the UserData holds the IoTile* for this read
IoTile* ioTile = (IoTile*) cqe.UserData;
if (ioTile == nullptr)
continue;
// post the completed IoTile to the acp queue so an
// iocp worker thread will pick it up and render it
::PostQueuedCompletionStatus(mIocpWorkerHandle,
ioTile->GetNumBytesTransferred(),
ioTile->GetIoKey(),
ioTile);
}
}
// worker thread
void WorkerThread() const
{
// create device context and memory bitmap
HDC sourceDC = ::GetDC(mHwnd);
HDC memoryDC = ::CreateCompatibleDC(sourceDC);
UINT numBytes = mTileSize *
mTileSize *
mNumChannels *
(mBitsPerChannel / 8);
BITMAPINFO bmi;
bmi.bmiHeader.biSize = sizeof(BITMAPINFOHEADER);
bmi.bmiHeader.biWidth = (LONG) mTileSize;
bmi.bmiHeader.biHeight = (LONG) mTileSize * -1;
bmi.bmiHeader.biPlanes = 1;
bmi.bmiHeader.biBitCount = mNumChannels * 8;
bmi.bmiHeader.biCompression = BI_RGB;
bmi.bmiHeader.biSizeImage = (DWORD) numBytes;
bmi.bmiHeader.biXPelsPerMeter = 0;
bmi.bmiHeader.biYPelsPerMeter = 0;
bmi.bmiHeader.biClrUsed = 0;
bmi.bmiHeader.biClrImportant = 0;
HBITMAP memoryBitmap = ::CreateCompatibleBitmap(sourceDC,
mTileSize,
mTileSize);
HBITMAP oldMemoryBitmap = (HBITMAP) ::SelectObject(memoryDC,
memoryBitmap);
// process tiles until shutdown
for (;;)
{
ULONG_PTR completionKey = 0;
OVERLAPPED* overlapped = NULL;
DWORD numberOfBytesTransferred = 0;
// pull completion key from apc queue if one
// is available and wait if not
::GetQueuedCompletionStatus(mIocpWorkerHandle,
&numberOfBytesTransferred,
&completionKey,
&overlapped,
INFINITE);
// check key and exit thread if it is a shutdown key
IoTile::IoKey ioKey = (IoTile::IoKey) completionKey;
if (ioKey == IoTile::IoKeyShutdown)
break;
// cast to iotile
IoTile* ioTile = (IoTile*) overlapped;
if (ioTile == nullptr)
continue;
// read
if (ioKey == IoTile::IoKeyRead)
{
// only 8-bit RGB or BGR rendering for this example
// this would be a good place to scale the color space
// to 8-bit if your image was higher bit depth but that
// is beyond the scope of this example since it would
// require computing pixel statistics at a minimum and
// possibly a full on histogram depending on your algorithm
::SetDIBits(memoryDC,
memoryBitmap,
0,
mTileSize,
ioTile->GetUserData(),
&bmi,
DIB_RGB_COLORS);
::BitBlt(sourceDC,
ioTile->GetPixelX(),
ioTile->GetPixelY(),
mTileSize,
mTileSize,
memoryDC,
0,
0,
SRCCOPY);
}
// let the ReadTile function that is blocking until all the IoTiles
// have been rendered that we are finished with this IoTile
ioTile->SetCompleted();
}
// cleanup
memoryBitmap = (HBITMAP) ::SelectObject(memoryDC, oldMemoryBitmap);
::DeleteObject(memoryBitmap);
::DeleteDC(memoryDC);
::ReleaseDC(mHwnd, sourceDC);
}
// disable unused functions to prevent resource issues.
FileIoRing(const FileIoRing&) = delete;
FileIoRing& operator=(const FileIoRing&) = delete;
FileIoRing(FileIoRing&&) = delete;
FileIoRing& operator=(FileIoRing&&) = delete;
// member variables
HWND mHwnd;
HANDLE mFileHandle;
HIORING mIoRingHandle;
HANDLE mIocpCompletionHandle;
HANDLE mIocpWorkerHandle;
HANDLE mCompletionEvent;
UINT64 mNumTilesX;
UINT64 mNumTilesY;
UINT mTileSize;
UINT mNumChannels;
UINT mBitsPerChannel;
UINT mNumLogicalCores;
std::thread mCompletionThread;
std::vector< std::thread> mWorkerThreads;
};
// simple class to render tiles from a file
// using ioringapi and iocp to demonstrate
// how both technologies can be combined to
// read data from disk and then render pixels
// in a multi-threaded, asyncronous way
class IoWindow
{
public:
IoWindow(HINSTANCE instance) :
mHwnd(nullptr),
mInstance(instance),
mClassName("IoWindowClass"),
mNumTilesX(8),
mNumTilesY(8),
mTileSize(64),
mNumChannels(3),
mBitsPerChannel(8) {}
~IoWindow()
{
mFileIoRing.Close();
DeleteObject(mBrush);
UnregisterClass(mClassName, mInstance);
}
bool Create(const char* filepath,
const UINT windowWidth,
const UINT windowHeight,
const UINT tileSize,
const UINT numTilesX,
const UINT numTilesY)
{
// initialize
mTileSize = tileSize;
mNumTilesX = numTilesX;
mNumTilesY = numTilesY;
// define the window class
WNDCLASSEX wc = {};
wc.cbSize = sizeof(WNDCLASSEX);
wc.lpfnWndProc = IoWindow::WindowProc;
wc.hInstance = mInstance;
wc.lpszClassName = mClassName;
// register the window
if (!::RegisterClassEx(&wc))
return false;
// create the window
mHwnd = ::CreateWindowEx(0,
mClassName,
"IoWindow",
WS_OVERLAPPEDWINDOW,
CW_USEDEFAULT,
CW_USEDEFAULT,
windowWidth,
windowHeight,
nullptr,
nullptr,
mInstance,
this);
if (mHwnd == nullptr)
return false;
// create the background brush to clear the window
mBrush = ::CreateSolidBrush(RGB(255, 255, 255));
// num tiles
SIZE_T numTiles = (SIZE_T) (mNumTilesX * mNumTilesY);
if (numTiles == 0) return false;
// compute tile length
SIZE_T tileLength = (SIZE_T) (mTileSize *
mTileSize *
mNumChannels *
(mBitsPerChannel / 8));
if (tileLength == 0) return false;
// compute memory buffer length
SIZE_T bufferLength = (SIZE_T) (numTiles * tileLength);
if (bufferLength == 0) return false;
// allocate aligned memory buffer
if (mMemoryAligned.Alloc(bufferLength) == nullptr)
return false;
// open the io ring file
mFileIoRing.Open(filepath,
mHwnd,
mNumTilesX,
mNumTilesY,
mTileSize);
// show and update the window.
::ShowWindow(mHwnd, SW_SHOWDEFAULT);
::UpdateWindow(mHwnd);
return true;
}
// the main message loop for the window
void Run()
{
MSG msg = {};
while (::GetMessage(&msg, nullptr, 0, 0))
{
::TranslateMessage(&msg);
::DispatchMessage(&msg);
}
}
private:
// WindowProc
static LRESULT CALLBACK WindowProc(HWND hWnd, UINT uMsg, WPARAM wParam, LPARAM lParam)
{
IoWindow* ioWindow = nullptr;
if (uMsg == WM_NCCREATE)
{
CREATESTRUCT* pCreate = reinterpret_cast< CREATESTRUCT*>(lParam);
ioWindow = reinterpret_cast< IoWindow*>(pCreate->lpCreateParams);
::SetWindowLongPtr(hWnd, GWLP_USERDATA, reinterpret_cast< LONG_PTR>(ioWindow));
}
else
ioWindow = reinterpret_cast< IoWindow*>(::GetWindowLongPtr(hWnd, GWLP_USERDATA));
if (ioWindow)
return ioWindow->HandleMessage(hWnd, uMsg, wParam, lParam);
return ::DefWindowProc(hWnd, uMsg, wParam, lParam);
}
// HandleMessage
LRESULT HandleMessage(HWND hWnd,
UINT uMsg,
WPARAM wParam,
LPARAM lParam)
{
switch (uMsg)
{
case WM_PAINT:
{
PAINTSTRUCT ps;
HDC hdc = ::BeginPaint(hWnd, &ps);
RECT rect = ps.rcPaint;
::FillRect(hdc, &rect, mBrush);
// paint gets called everytime the window
// is resized and frankly it's called a
// ton. normally you would create a background
// bitmap and only render from the file when
// needed (resize), but the reason I am re-rendering
// everytime for this example is to show off
// the IO speed. Resize the window like a spaz
// and watch the IoRing API crush your drive.
mFileIoRing.ReadTiles(0,
0,
mNumTilesX,
mNumTilesY,
mMemoryAligned);
::EndPaint(hWnd, &ps);
return 0;
}
case WM_DESTROY:
{
::PostQuitMessage(0);
return 0;
}
default:
return ::DefWindowProc(hWnd,
uMsg,
wParam,
lParam);
}
}
HBRUSH mBrush;
UINT mTileSize;
UINT mNumTilesX;
UINT mNumTilesY;
UINT mNumChannels;
UINT mBitsPerChannel;
MemoryAligned mMemoryAligned;
FileIoRing mFileIoRing;
HWND mHwnd;
HINSTANCE mInstance;
const char* mClassName;
};
// simple class to create a file that is
// filled with RGB tiles to demonstrate
// the IO rendering in an asyncronous manner
// there is no header, just tiles of pixels
class ImageFile
{
public:
// create a simple tiled image file
bool Create(const std::string& filePath,
int tileSize,
int numTilesX,
int numTilesY)
{
std::ofstream file(filePath,
std::ios::out | std::ios::binary);
if (file.is_open() == false) return false;
srand(static_cast< UINT>(time(0)));
const int pixelsPerTile = tileSize * tileSize;
const int bytesPerPixel = mChannels;
const int bytesPerTile = pixelsPerTile * bytesPerPixel;
const int totalTiles = numTilesX * numTilesY;
// create tiles with random colors
std::vector< BYTE> pixels(bytesPerTile);
for (int i = 0; i < totalTiles; ++i)
{
unsigned char red = rand() % 256;
unsigned char green = rand() % 256;
unsigned char blue = rand() % 256;
for (int j = 0; j < pixelsPerTile; ++j)
{
pixels[j * bytesPerPixel + 0] = red;
pixels[j * bytesPerPixel + 1] = green;
pixels[j * bytesPerPixel + 2] = blue;
}
file.write(reinterpret_cast< const char*>(pixels.data()),
bytesPerTile);
}
file.flush();
file.close();
return true;
}
private:
int mChannels{3};
};
// application entry point
int main(int argc, char* argv[])
{
// define the image parameters
const std::string filePath = "c:\\rgb_tiles.ioring";
const int tileSize = 64;
const int numTilesX = 8;
const int numTilesY = 8;
// create the image file that is filled with
// RGB tiles and are colored with random colors
ImageFile imageFile;
if (imageFile.Create(filePath,
tileSize,
numTilesX,
numTilesY) == false)
return 1;
// get console window instance
HINSTANCE instance = (HINSTANCE) ::GetModuleHandle(nullptr);
if (instance == nullptr) return 1;
// create the io window
IoWindow ioWindow(instance);
if (ioWindow.Create(filePath.c_str(),
800,
600,
tileSize,
numTilesX,
numTilesY) == false)
return 1;
// minimize the console window
::ShowWindow(::GetConsoleWindow(), SW_MINIMIZE);
// run the io window
ioWindow.Run();
return 0;
}
Overview of the Code Example
The provided C++ code demonstrates a hybrid, high-performance solution for a tiled image viewer. It uses two separate, but connected, asynchronous I/O models to achieve both high throughput and parallel processing.
- File Creation (ImageFile class): The main function starts by creating a mock image file (rgb_tiles.ioring). This file consists of a simple grid of tiles, each filled with a solid, random RGB color. There is no header; the data is a continuous stream of pixels.
- The Hybrid Engine (FileIoRing class): This is the core of the I/O engine. It contains two main components:
- The Application Window (IoWindow class): This class manages the main window. In its WM_PAINT message handler, it calls the
FileIoRing::ReadTilesfunction. The key takeaway here is that although ReadTiles appears synchronous to the caller (it blocks until all tiles are rendered), the actual work is happening asynchronously in the background.
Application Screenshot - Tiles Rendered with IOCP and I/O Rings
Example Code - Asynchronous Tiled Image Renderer
A Master Class in Fast, Scalable, Asynchronous I/O
This code is a master class because it intelligently partitions the problem to leverage the unique strengths of each I/O model.
- I/O Rings for Raw Speed:Reading from a local file is a device I/O operation. By using I/O Rings, the application minimizes the overhead of system calls and context switching. It can submit a large batch of read requests for all the visible tiles at once, which the kernel can then execute with maximum efficiency. This ensures the data is pulled from the disk as fast as the hardware allows. This is the "fast" part of the equation.
- IOCP for Scalable, Parallel Processing: Once the data for a tile is in memory, the next step is to render it. This is a CPU-bound task that can be parallelized. Instead of having the single I/O completion thread perform the rendering, it dispatches the work to a pool of worker threads using IOCP. IOCP's ability to efficiently manage a pool of threads ensures that rendering tasks are processed in parallel across all available CPU cores. This is the "scalable, parallel" part of the equation.
By combining these two models, the application achieves a powerful synergy: the I/O-intensive work is optimized for raw speed via I/O Rings, while the CPU-intensive work is optimized for parallel scalability via IOCP. This demonstrates that for a complex, real-world application, the optimal solution is not about choosing one API over the other, but about understanding their unique strengths and combining them to build a superior, high-performance system.
Hybrid I/O for High-Performance Video Rendering
The same hybrid strategy demonstrated for the tiled image viewer can be applied to build a high-performance video renderer. A video player is essentially a complex I/O and processing pipeline that must operate in real-time to avoid glitches.
- High-Speed I/O with I/O Rings: The core task of reading raw video packets, audio packets, closed captioning data, and other metadata (such as KLV packets) from a single video file can be offloaded to an IoRingAPI instance. This allows the application to batch multiple read requests for different data streams and submit them to the kernel with minimal overhead. The raw packets are read from disk as fast as possible, filling the application's memory buffers.
- Intelligent Packet Routing with IOCP:As the I/O Rings complete the read operations, the dedicated CompletionThread would intelligently identify the type of data packet (video, audio, etc.). Instead of processing the data itself, it would post these completed packets as completion packets to a specialized IOCP. The completion key for each packet would serve as a routing mechanism, directing the work to the appropriate worker thread pool.
- Parallel Processing with IOCP Worker Threads:A dedicated pool ofIOCP worker threads would be responsible for decoding and decompressing each type of packet. A pool for video frames would handle the CPU-intensive task of video decoding in parallel. Separate pools could handle audio decompression and closed caption data parsing. Once a packet is decoded and ready, a final routing step would send it to the correct rendering endpoint:
By scaling the decoding and rendering operations across multiple threads with IOCP, the player can ensure that decoded packets are always ready in the rendering queue, avoiding the jittering and stalling that plagues poorly written video players. This demonstrates a masterful use of both APIs to create a truly fast and scalable multimedia application.
What are your thoughts? Do you have any comments or corrections?