Alex Tardif: Graphics Programmer

Learning D3D12 from D3D11 - Part 1: Getting Started, Command Queues, and Fencing

Purpose

I want to quickly lay out the purpose of this series before diving in. This is a guide for those who have been using D3D11, and have not yet made the move to D3D12. This is also a guide for those who have started using D3D12, but have found it intimidating, difficult to manage, or were otherwise unsure of how to build a renderer with it. Specifically, I focus on the important new concepts without diving too deep into API details, and discuss rendering systems you can build on top of D3D12 that feel familiar, while taking advantage of D3D12's new strengths. I've cribbed ideas for the architecture I detail here from a number of sources, check my resources section at the end for references to help guide you as well. I have some utility functions in here that I use but don't cover, but those should all be standard things you've seen before - asserts, throwing on HRESULT failures, math lib stuff, general containers, etc. Lastly, I don't make any claims that all of this is top-performance code. It's pretty good, but in order to make this digestible, I spend less time on complexity & perf and more on building an understanding so that you can build on it yourself with a strong foundation.

Getting Started

First and foremost is the project setup. DirectX now lives within the Windows 10 SDK. You'll want to download that and set up your VS project (VS 2017 recommended) to be able to use it. There's a ton of resources online for how to do that, so I won't be covering that here. Once you've done that you're going to want to do all the core stuff to set up your window, device and swap chain. This is pretty much the same old, same old, with a few small API changes. If you want to see an example of how to do that, just grab the Hello World sample from Microsoft's Github and use it as a reference. You'll probably also want to look at the more recent additions to DXGI, which allow you to get more information about feature support, display information, etc. Make sure to have the Debug Layer enabled via ID3D12Debug::EnableDebugLayer, because certain things are really easy to mess up, and the debug layer is (for the most part) pretty solid at catching those mistakes. While you're doing all this work, you'll be getting very well acquainted with IID_PPV_ARGS. The only thing I'll note here as it relates to stuff I do below, is that the BufferCount field of the swap chain desc is set to 2 for the purpose of my example, and this is so that we can work on one frame while another is being processed and presented. After getting far enough with that, you're going to run into the first important API addition: whichever CreateSwapChain function you use, it takes a new parameter type we haven't needed before: the command queue (ID3D12CommandQueue).

The Command Queue

The D3D12 command queue is where all of your command execution and fencing will take place. Where in DX11 you would just submit your graphics commands on the D3D context, you now have to build up those commands in command lists (more on that later), and then execute those command lists in a command queue and fence where necessary. Take a look at that documentation link really quick, and then let's look at how we might wrap up this queue for ease of use.

class Direct3DQueue
{
public:
    Direct3DQueue(ID3D12Device* device, D3D12_COMMAND_LIST_TYPE commandType);
    ~Direct3DQueue();
 
    bool IsFenceComplete(uint64 fenceValue);
    void InsertWait(uint64 fenceValue);
    void InsertWaitForQueueFence(Direct3DQueue* otherQueue, uint64 fenceValue);
    void InsertWaitForQueue(Direct3DQueue* otherQueue);
 
    void WaitForFenceCPUBlocking(uint64 fenceValue);
    void WaitForIdle() { WaitForFenceCPUBlocking(mNextFenceValue - 1); }
 
    ID3D12CommandQueue* GetCommandQueue() { return mCommandQueue; }
 
    uint64 PollCurrentFenceValue();
    uint64 GetLastCompletedFence() { return mLastCompletedFenceValue; }
    uint64 GetNextFenceValue() { return mNextFenceValue; }
    ID3D12Fence* GetFence() { return mFence; }
 
    uint64 ExecuteCommandList(ID3D12CommandList* List);
 
private:
    ID3D12CommandQueue* mCommandQueue;
    D3D12_COMMAND_LIST_TYPE mQueueType;
 
    std::mutex mFenceMutex;
    std::mutex mEventMutex;
 
    ID3D12Fence* mFence;
    uint64 mNextFenceValue;
    uint64 mLastCompletedFenceValue;
    HANDLE mFenceEventHandle;
};

You can see it's just as I explained above - queues are all about execution and fencing. Let's talk briefly about what fencing means in the ID3D12Fence sense. We (D3D devs) no longer execute seemingly synchronously with the CPU in D3D12. In reality, D3D11 was doing a lot of async stuff behind the scenes for you, but now it's up to you to do it. That sounds daunting at first, but my hope is that, after reading this guide, you'll see it's not much work at all once you have the infrastructure. As this relates to fencing, this means that your command execution is not CPU-blocking, ie once you make a call to mCommandQueue->ExecuteCommandLists, the GPU is going to start doing work, but the CPU side is not going to wait for that work unless you tell it to. This gives us a lot of power, because as the graphics developer, we will know exactly when we need to wait for that work to be done, rather than the D3D layer/driver needing to guess for us. That's where the ID3D12Fence comes in - it's the object we'll notify when we do work, and it's the object we'll use to wait for the work to be done. In D3D12, a fence only has the granularity level of command list execution via ExecuteCommandLists, so you're not fencing on individual calls so much as entire blocks of work. In practice, while just getting started, the only time you'll need to interact with the fence to wait on anything is when you do uploads, or wait on entire frames to finish. Let's take a look now at the implementation:

Direct3DQueue::Direct3DQueue(ID3D12Device* device, D3D12_COMMAND_LIST_TYPE commandType)
{
    mQueueType = commandType;
    mCommandQueue = NULL;
    mFence = NULL;
    mNextFenceValue = ((uint64_t)mQueueType << 56) + 1;
    mLastCompletedFenceValue = ((uint64_t)mQueueType << 56);
 
    D3D12_COMMAND_QUEUE_DESC queueDesc = {};
    queueDesc.Type = mQueueType;
    queueDesc.NodeMask = 0;
    device->CreateCommandQueue(&queueDesc, IID_PPV_ARGS(&mCommandQueue));
 
    Direct3DUtils::ThrowIfHRESULTFailed(device->CreateFence(0, D3D12_FENCE_FLAG_NONE, IID_PPV_ARGS(&mFence)));
 
    mFence->Signal(mLastCompletedFenceValue);
 
    mFenceEventHandle = CreateEventEx(NULL, false, false, EVENT_ALL_ACCESS);
    Application::Assert(mFenceEventHandle != INVALID_HANDLE_VALUE);
}

Direct3DQueue::~Direct3DQueue()
{
    CloseHandle(mFenceEventHandle);
 
    mFence->Release();
    mFence = NULL;
 
    mCommandQueue->Release();
    mCommandQueue = NULL;
}

This is how we set up our queue and fence information. First up is the D3D12_COMMAND_LIST_TYPE, which is what tells the device what type of queue we're making. There are a few, but the important ones are D3D12_COMMAND_LIST_TYPE_DIRECT (capable of graphics, compute, and copy work), D3D12_COMMAND_LIST_TYPE_COMPUTE (capable of compute and copy work) and finally D3D12_COMMAND_LIST_TYPE_COPY (only able to do copy work). When we say "copy" we're talking generally about uploads to the GPU, which we need to do manually in D3D12, and will cover later on. The style I've been taking with these queues is to create one of each. The "direct" queue is where I submit all of my immediate graphics and compute work, the "compute" queue is where I submit my async compute work, and the "copy" queue is where I issue all of my uploads. Creating the queue is straight-forward, the question you probably have is what the NodeMask parameter is for. That value dictates which GPU node (card) to associate the queue with, in a multi-adapter situation. The multi-adapter API in D3D12 is actually pretty awesome, but a bit advanced for someone just trying to learn D3D12 to start with. Still, whenever you're interested, I recommend checking out this document explaining how to do it. Also, see this for more information on queues.

Next comes the fence initialization. We need to create the fence itself, and an event handle that we can use to wait on the fence when we need to. The only odd bit here is that we prime the fence with the queue type shifted by 56. I've seen this trick in a number of samples and have found it pretty handy for lazy queue lookups which we'll see a bit further down. The idea is that all we need is the fence value itself to know which queue type it came from. Specifically, when we tell our queue manager (detailed below) to check if a fence value is complete, it can know just by shifting the fence value back by 56 to get the queue type. We signal the fence with this value to set it up. Next, we have functions to let us check on the fence:

uint64 Direct3DQueue::PollCurrentFenceValue()
{
    mLastCompletedFenceValue = MathHelper::Max(mLastCompletedFenceValue, mFence->GetCompletedValue());
    return mLastCompletedFenceValue;
}
 
bool Direct3DQueue::IsFenceComplete(uint64 fenceValue)
{
    if (fenceValue > mLastCompletedFenceValue)
    {
        PollCurrentFenceValue();
    }
 
    return fenceValue <= mLastCompletedFenceValue;
}

PollCurrentFenceValue can be used to get the latest completed fence value from the fence itself, which we store off so that when we check IsFenceComplete, we can early out if we know the fence passed in has already been reached. The reason for this is because GetCompletedValue is noted as "not cheap", so we do this to prevent querying unnecessarily. Next up are some utilities to insert waits for fences into the command queue:

void Direct3DQueue::InsertWait(uint64 fenceValue)
{
    mCommandQueue->Wait(mFence, fenceValue);
}
 
void Direct3DQueue::InsertWaitForQueueFence(Direct3DQueue* otherQueue, uint64 fenceValue)
{
    mCommandQueue->Wait(otherQueue->GetFence(), fenceValue);
}
 
void Direct3DQueue::InsertWaitForQueue(Direct3DQueue* otherQueue)
{
    mCommandQueue->Wait(otherQueue->GetFence(), otherQueue->GetNextFenceValue() - 1);
}

These functions allow our queues to have the useful functionality of not only being able to wait on their own fences, but the fences of other queues as well. This is useful if, say, you wanted to ensure some work on another queue was complete (like async compute, for example) before continuing on. I have chosen my function names very carefully here to clearly distinguish the difference between "InsertWait" and "WaitForFenceCPUBlocking". mCommandQueue->Wait will not make the CPU wait for a fence. Rather, it will make the queue itself (and thus the GPU) wait until that fence value is reached before continuing any work that comes after it. A situation that I mentioned before where this might be useful is when the direct queue is doing graphics work, but we want it to wait for the compute queue to finish an async compute pass, like GPU particle simulation for example, before it does its particle rendering work. We don't want to block the CPU to ensure this, because we'd be wasting CPU time, so we insert the wait into the queue to handle it and continue on our merry way instead. When the time comes to need to wait for a fence on the CPU, we call the following function:

void Direct3DQueue::WaitForFenceCPUBlocking(uint64 fenceValue)
{
    if (IsFenceComplete(fenceValue))
    {
        return;
    }
 
    {
        std::lock_guard<std::mutex> lockGuard(mEventMutex);
 
        mFence->SetEventOnCompletion(fenceValue, mFenceEventHandle);
        WaitForSingleObjectEx(mFenceEventHandle, INFINITE, false);
        mLastCompletedFenceValue = fenceValue;
    }
}

We start by earlying out if we've already reached our fence, but if we haven't, we simply call the API and tell the CPU to wait for the fence value to be hit. It's highly unlikely the situation would arise, because there's so few places where you'll want to do this, but we do want to make sure we don't do this from multiple threads at once (only one SetEventOnCompletion with our handle), so we lock around it. Last and most importantly, the true purpose of our queues, to submit command list work:

uint64 Direct3DQueue::ExecuteCommandList(ID3D12CommandList* commandList)
{
    Direct3DUtils::ThrowIfHRESULTFailed(((ID3D12GraphicsCommandList*)commandList)->Close());
    mCommandQueue->ExecuteCommandLists(1, &commandList);
 
    std::lock_guard<std::mutex> lockGuard(mFenceMutex);
 
    mCommandQueue->Signal(mFence, mNextFenceValue);
 
    return mNextFenceValue++;
}

We'll see more about command lists later, but just know that a command list is where our rendering meat is - draw calls, dispatches, etc. We need to close the command list before executing it, then we submit it via ExecuteCommandLists. Finally, we signal with the next fence value so that we know this command list execution has finished when that fence value has been reached. We return that fence value to the calling function so that it can wait on it when needed, and then increment the fence value. We lock around the fence value usage so that multiple threads don't mess with it simultaneously.

Here's some interesting facts about ExecuteCommandLists. First, it's thread-safe, so multiple threads can issue calls to it simultaneously, safely (assuming order is not important). Second, if you execute Command List A, and then execute Command List B on the same queue, Command List A is guaranteed to finish before Command List B is executed. This is important to note so that you don't have to bother fencing between them. Third, if you execute Command List A and Command List B together in a single ExecuteCommandList call (though not shown here you can submit multiple at once), the queue will try to interleave work done by both command lists when possible. This can prove to be a good optimization if you know this will not cause issues in your submitted commands. Finally, this is also when the bulk of the command validation will run, so don't be surprised if you don't see mistakes pop up in the validation layer until the work is actually executed. I'm mentioning all of this information here because, for whatever reason, it's not called out on MSDN on the page for that function call.

Finally, let's wrap our queues up in a manager for ease of use:

class Direct3DQueueManager
{
public:
	Direct3DQueueManager(ID3D12Device* device);
	~Direct3DQueueManager();
 
	Direct3DQueue *GetGraphicsQueue() { return mGraphicsQueue; }
	Direct3DQueue *GetComputeQueue() { return mComputeQueue; }
	Direct3DQueue *GetCopyQueue() { return mCopyQueue; }
 
	Direct3DQueue *GetQueue(D3D12_COMMAND_LIST_TYPE commandType);
 
	bool IsFenceComplete(uint64 fenceValue);
	void WaitForFenceCPUBlocking(uint64 fenceValue);
	void WaitForAllIdle();
 
private:
	Direct3DQueue *mGraphicsQueue;
	Direct3DQueue *mComputeQueue;
	Direct3DQueue *mCopyQueue;
};

Now we'll have one queue of each type, with a few utility functions to fence and sync all queues when necessary. Creation/destruction/get are all simple:

Direct3DQueueManager::Direct3DQueueManager(ID3D12Device* device)
{
	mGraphicsQueue = new Direct3DQueue(device, D3D12_COMMAND_LIST_TYPE_DIRECT);
	mComputeQueue = new Direct3DQueue(device, D3D12_COMMAND_LIST_TYPE_COMPUTE);
	mCopyQueue = new Direct3DQueue(device, D3D12_COMMAND_LIST_TYPE_COPY);
}
 
Direct3DQueueManager::~Direct3DQueueManager()
{
	delete mGraphicsQueue;
	delete mComputeQueue;
	delete mCopyQueue;
}
 
Direct3DQueue *Direct3DQueueManager::GetQueue(D3D12_COMMAND_LIST_TYPE commandType)
{
	switch (commandType)
	{
	case D3D12_COMMAND_LIST_TYPE_DIRECT:
		return mGraphicsQueue;
	case D3D12_COMMAND_LIST_TYPE_COMPUTE: 
		return mComputeQueue;
	case D3D12_COMMAND_LIST_TYPE_COPY: 
		return mCopyQueue;
	default: 
		Direct3DUtils::ThrowRuntimeError("Bad command type lookup in queue manager.");
	}
 
	return NULL;
}

Next we see the advantage of that bit shifting we did earlier. Notice we don't need to bother passing in which queue type we want to check, it's just known simply from the fence value itself. Handy!

bool Direct3DQueueManager::IsFenceComplete(uint64 fenceValue)
{
	return GetQueue((D3D12_COMMAND_LIST_TYPE)(fenceValue >> 56))->IsFenceComplete(fenceValue);
}
 
void Direct3DQueueManager::WaitForFenceCPUBlocking(uint64 fenceValue)
{
	Direct3DQueue *commandQueue = GetQueue((D3D12_COMMAND_LIST_TYPE)(fenceValue >> 56));
	commandQueue->WaitForFenceCPUBlocking(fenceValue);
}

Lastly, a function we can use to wait and make sure we're no longer doing any GPU work. This is super useful for those rare moments when it's easier to do certain work in isolation, like when you change certain graphics options that require lots of resource creation/destruction, close down the game, etc.

void Direct3DQueueManager::WaitForAllIdle()
{
	mGraphicsQueue->WaitForIdle();
	mComputeQueue->WaitForIdle();
	mCopyQueue->WaitForIdle();
}

That's it for command queues! We'll revisit the execution a little bit when we make use of command lists, but apart from that, we've got queues that are capable of executing and waiting for commands.

Resources

The following are a list of resources from which I pull the vast majority of my information and design from. It's thanks to these resources that I feel comfortable with DX12 enough to be able to share, so thank you to everyone who contributes to these!
The Microsoft DirectX Graphics Sample Github, especially the MiniEngine core. I borrow heavily from this.
The DirectXTech Forum "Getting Started" Section.
MJP's awesome bindless deferred rendering github. His resource upload style is great.
Microsoft's Direct3D 12 Programming Guide.
The D3D12 Reference Docs.
Nvidia's DX12 Do's and Don'ts, especially useful at answering "how should I do this?" questions.
Intel's DX12 Migration Tutorials.

Alex Tardif
Graphics & Game Programmer

Learning D3D12 from D3D11 - Part 1: Getting Started, Command Queues, and Fencing

Purpose

Getting Started

The Command Queue

Resources

Contact