Synchronizing CPU and GPU work
Avoid stalls between CPU and GPU work by using multiple instances of a resource.
Overview
In this sample code project, you learn how to manage data dependencies and avoid processor stalls between the CPU and the GPU.
The project continuously renders triangles along a sine wave. In each frame, the sample updates the position of each triangle’s vertices and then renders a new image. These dynamic data updates create an illusion of motion, where the triangles appear to move along the sine wave.
[Image]
The sample stores the triangle vertices in a buffer that’s shared between the CPU and the GPU. The CPU writes data to the buffer and the GPU reads it.
Understand the solution to data dependencies and processor stalls
Resource sharing creates a data dependency between the processors; the CPU needs to finish writing to the resource before the GPU reads it. If the GPU reads the resource before the CPU writes to it, the GPU reads undefined resource data. If the GPU reads the resource while the CPU is writing to it, the GPU reads incorrect resource data.
[Image]
These data dependencies create processor stalls between the CPU and the GPU; each processor needs to wait for the other to finish its work before beginning its own work.
However, because the CPU and GPU are separate processors, you can make them work simultaneously by using multiple instances of a resource. Each frame, you need to provide the same arguments to your shaders, but this doesn’t mean you need to reference the same resource object. Instead, you create a pool of multiple instances of a resource and use a different one each time you render a frame. For example, as shown below, the CPU can write position data to a buffer used for frame n+1, at the same time that the GPU reads position data from a buffer used for frame n. By using multiple instances of a buffer, the CPU and the GPU can work continuously and avoid stalls as long as you keep rendering frames.
[Image]
Initialize data with the CPU
Define a custom AAPLVertex structure that represents a vertex. Each vertex has a position and a color:
Define a custom AAPLTriangle class that provides an interface to a default triangle, which is made up of 3 vertices:
Initialize multiple triangle vertices with a position and a color, and store them in an array of triangles, _triangles:
Allocate data storage
Calculate the total storage size of your triangle vertices. Your app renders 50 triangles; each triangle has 3 vertices, totaling 150 vertices, and each vertex is the size of AAPLVertex:
Initialize multiple buffers to store multiple copies of your vertex data. For each buffer, allocate exactly enough memory to store 150 vertices:
Upon initialization, the contents of the buffer instances in the _vertexBuffers array are empty.
Update data with the CPU
In each frame, at the start of the draw(in:) render loop, use the CPU to update the contents of one buffer instance in the updateState method:
After you update a buffer instance, you don’t access its data with the CPU for the rest of the frame.
Encode GPU commands
Next, you encode commands that reference the buffer instance in a render pass:
Commit and execute GPU commands
At the end of the render loop, call your command buffer’s commit() method to submit your work to the GPU:
The GPU begins its work and reads from the vertices buffer in the RasterizerData vertex shader, which takes the buffer instance as an input argument:
Reuse multiple buffer instances in your app
For each frame, perform the following steps, as described above. A full frame’s work is finished when both processors have completed their work.
Write data to a buffer instance.
Encode commands that reference the buffer instance.
Commit a command buffer that contains the encoded commands.
Read data from the buffer instance.
When a frame’s work is finalized, the CPU and the GPU no longer need the buffer instance used in that frame. However, discarding a used buffer instance and creating a new one for each frame is expensive and wasteful. Instead, as shown below, set up your app to cycle through a first in, first out (FIFO) queue of buffer instances, _vertexBuffers, that you can reuse. The maximum number of buffer instances in the queue is defined by the value of MaxFramesInFlight, set to 3:
In each frame, at the start of the render loop, you update the next buffer instance from the _vertexBuffer queue. You cycle through the queue sequentially and update only one buffer instance per frame; at the end of every third frame, you return to the start of the queue:
Manage the rate of CPU and GPU work
When you have multiple instances of a buffer, you can make the CPU start work for frame n+1 with one instance, while the GPU finishes work for frame n with another instance. This implementation improves your app’s efficiency by making the CPU and the GPU work simultaneously. However, you need to manage your app’s rate of work so you don’t exceed the number of buffer instances available.
To manage your app’s rate of work, use a semaphore to wait for full frame completions in case the CPU is working much faster than the GPU. A semaphore is a non-Metal object that you use to control access to a resource that’s shared across multiple processors (or threads). The semaphore has an associated counting value, which you decrement or increment, that indicates whether a processor has started or finished accessing a resource. In your app, a semaphore controls CPU and GPU access to buffer instances.
You initialize the semaphore with a counting value of MaxFramesInFlight, to match the number of buffer instances. This value indicates that your app can simultaneously work on a maximum of 3 frames at any given time:
At the start of the render loop, you decrement the semaphore’s counting value by 1. This indicates that you’re ready to work on a new frame. However, if the counting value falls below 0, the semaphore makes the CPU wait until you increment the value:
At the end of the render loop, you register a command buffer completion handler. When the GPU completes the command buffer’s execution, it calls this completion handler and you increment the semaphore’s counting value by 1. This indicates that you’ve completed all work for a given frame and you can reuse the buffer instance used in that frame:
The addCompletedHandler(_:) method registers a block of code that’s called immediately after the GPU has finished executing the associated command buffer. Because you use only one command buffer per frame, receiving the completion callback indicates that the GPU has completed the frame.
Set the mutability of your buffers
Your app performs all per-frame rendering setup on a single thread. First it writes data to a buffer instance with the CPU. After that, it encodes rendering commands that reference the buffer instance. Finally, it commits a command buffer for GPU execution. Because these tasks always happen in this order on a single thread, the app guarantees that it finishes writing data to a buffer instance before it encodes a command that references the buffer instance.
This order allows you to mark your buffer instances as immutable. When you configure your render pipeline descriptor, set the mutability property of the vertex buffer at the buffer instance index to MTLMutability.immutable:
Metal can optimize the performance of immutable buffers, but not mutable buffers. For best results, use immutable buffers as much as possible.