Encoding indirect command buffers on the GPU

Maximize CPU to GPU parallelization by generating render commands on the GPU.

Overview

This sample app demonstrates how to use indirect command buffers (ICB) to issue rendering instructions from the GPU. When you have a rendering algorithm that runs in a compute kernel, use ICBs to generate draw calls based on your algorithm’s results. The sample app uses a compute kernel to remove invisible objects submitted for rendering, and generates draw commands only for the objects currently visible in the scene.

[Image]

Without ICBs, you can’t submit rendering commands on the GPU. Instead, the CPU waits for your compute kernel’s results before generating the render commands. Then, the GPU waits for the rendering commands to make it across the CPU to GPU bridge. The following diagram shows how this creates a slower round trip:

[Image]

The sample code project, Encoding indirect command buffers on the CPU introduces ICBs by creating a single ICB to reuse its commands every frame. While the former sample saved expensive command-encoding time by reusing commands, this sample uses ICBs to effect a GPU-driven rendering pipeline.

The techniques shown by this sample include issuing draw calls from the GPU, and the process of executing a select set of draws.

Getting started

This project contains targets for macOS and iOS. Run the iOS scheme on a physical device because Metal isn’t supported in the simulator.

The sample calls an MTLComputeCommandEncoder instances’s dispatchThreads(_:threadsPerThreadgroup:) method, which is available to a GPU that supports the following feature sets and later:

MTLFeatureSet_iOS_GPUFamily4_v2
MTLFeatureSet_macOS_GPUFamily2_v1

Define the data read by the ICB

In an ideal scenario, you store each mesh in its own buffer. However, on iOS, kernels running on the GPU can only access a limited number of data buffers per execution. To reduce the number of buffers needed during the ICBs execution, you pack all meshes into a single buffer at varying offsets. Then, use another buffer to store the offset and size of each mesh. The process to do this follows.

At initialization, create the data for each mesh:

Count the individual and accumulated mesh sizes and create the container buffer:

Finally, insert each mesh into the container buffer while noting its offset and size in the second buffer:

Update the data read by the ICB dynamically

By culling non-visible vertices from the data fed to the rendering pipeline, you save significant rendering time and effort. To do that, use the same compute kernel that encodes the ICB’s commands to continually update the ICB’s data buffers:

The parallel nature of the GPU partitions the compute task for you, resulting in multiple offscreen meshes getting culled concurrently.

Pass an ICB to a compute kernel using an argument buffer

To get an ICB on the GPU and make it accessible to a compute kernel, you pass it through an argument buffer, as follows:

Define the container argument buffer as a structure that contains one member, the ICB:

Encode the ICB into the argument buffer:

Pass the ICB (_indirectCommandBuffer) to the kernel by setting the argument buffer on the kernel’s compute command encoder:

Because you pass the ICB through an argument buffer, standard argument buffer rules apply. Call useResource on the ICB to tell Metal to prepare its use:

Encode and optimize ICB commands

Reset the ICB’s commands to their initial before beginning encoding:

Encode the ICB’s commands by dispatching the compute kernel:

Optimize your ICB commands to remove empty commands or redundant state by calling optimizeIndirectCommandBuffer:withRange::

This sample optimizes ICB commands because redundant state results from the kernel setting a buffer for each draw, and encoding empty commands for each invisible object. By removing the empty commands, you can free up a significant number of blank spaces in the command buffer that Metal otherwise spends time skipping at runtime.

Execute the ICB

Draw the onscreen meshes by calling executeCommandsInBuffer on your render command encoder:

While you can encode an ICB’s commands in a compute kernel, you call executeCommandsInBuffer from your host app to encode a single command that contains all of the commands encoded by the compute kernel. By doing this, you choose the queue and buffer that the ICB’s commands go into. When you call executeIndirectCommandBuffer determines the placement of the ICB’s commands among any other commands you may also encode in the same buffer.