Adding a display mask rectangle metadata track to a movie file

Show a specific area of a video by using timed display mask rectangle metadata.

Overview

A QuickTime movie can provide display mask rectangle metadata that indicates the area of a video to show during playback. On visionOS 26 and later, the system reads this metadata and crops the video to the specified display area and renders the area outside the display mask as transparent. You can use a display mask to remove the encoded black letterbox or pillarbox at rendering time, or dynamically change the visible portion of a video for creative effects.

This sample is a command-line app that demonstrates how to create a QuickTime movie file with display mask rectangle metadata. It adds a timed metadata track to store the display mask rectangle metadata, and associates a render track reference to the video to signal the association to the video playback system.

The screen recordings below show how the display mask rectangle metadata (two types) added by the sample app affects displayed video compared to the original movie file.

Configure the sample code project

The sample requires three arguments:

./AVAddDisplayMaskTrack <input-path> <output-path> <display-mask-type>

<input-path>: The path to the existing source QuickTime movie file with a video track.
<output-path>: The path to the new output QuickTime movie file, which includes the source movie file’s media tracks with the additional display mask rectangle timed metadata track.
<display-mask-type>: An integer type for the display mask. There are only two types, 1 or 2; the default is 1 if the third argument isn’t provided or not a valid value.

The display-mask-type argument indicates the display mask to write to the movie file:

Type 1 display mask is a static square that’s 75 percent of the shorter side of the video’s dimensions, and centered on the video frame for the entire duration of the movie. For example, if the video’s dimensions are 1920 x 1080, then the display mask is 810 x 810 (1080 * 0.75 = 810), and centered at (960, 540).
Type 2 display mask is a per-frame square mask that’s 30 percent of the shorter side of the video’s dimensions, and moves across the video frame. This type illustrates dynamic display mask metadata that updates at the associated video track’s frame rate.

Set up a display mask rectangle metadata track

The sample provides a MovieProcessor class that contains the app’s metadata processing logic. When you run the app, it passes the command-line arguments you specify to the MovieProcessor class’s processMovie(inputPath:outputPath:displayMaskType:) method. This method sets up the reading and writing functionality using AVAssetReader and AVAssetWriter, respectively.

To allow the app to append metadata during asset writing, this method calls addDisplayMaskMetadataTrack(to:videoInput:videoInfo:) to create an AVAssetWriterInput that writes a timed metadata track for the display mask rectangle. Before creating the writer input, this method creates a CMMetadataFormatDescription for the display mask metadata. The format description uses the boxed metadata (mebx) type and pairs the kCMMetadataIdentifier_QuickTimeMetadataDisplayMaskRectangleMono identifier with the kCMMetadataBaseDataType_RasterRectangleValue data type:

// Define the metadata specifications for the monoscopic display mask rectangle.
let metadataSpecifications: [[String: Any]] = [[
    kCMMetadataFormatDescriptionMetadataSpecificationKey_Identifier as String:
        kCMMetadataIdentifier_QuickTimeMetadataDisplayMaskRectangleMono as String,
    kCMMetadataFormatDescriptionMetadataSpecificationKey_DataType as String:
        kCMMetadataBaseDataType_RasterRectangleValue as String
]]

// Create the `CMMetadataFormatDescription` for the monoscopic display mask rectangle
// in boxed metadata (`mebx`) type.
var metadataFormatDesc: CMMetadataFormatDescription? = nil
let status = CMMetadataFormatDescriptionCreateWithMetadataSpecifications(
    allocator: kCFAllocatorDefault,
    metadataType: kCMMetadataFormatType_Boxed,
    metadataSpecifications: metadataSpecifications as CFArray,
    formatDescriptionOut: &metadataFormatDesc
)

With the format description in place, the method creates an AVAssetWriterInput for the metadata track. It sets expectsMediaDataInRealTime to false because the app writes metadata samples as fast as it can process them, rather than receiving them from a live capture source. It also sets the mediaTimeScale to match the video track to align the metadata sample timestamps precisely with the video frames:

// Create the `AVAssetWriterInput` for the display mask metadata track and attach it to `AVAssetWriter`.
metadataInput = AVAssetWriterInput(mediaType: .metadata, outputSettings: nil, sourceFormatHint: metadataFormatDesc)
guard let metadataInput else {
    throw ProcessingError.writerInputCreationFailed("DisplayMask metadata.")
}

metadataInput.expectsMediaDataInRealTime = false
metadataInput.mediaTimeScale = videoInfo.timescale

The method then creates an AVAssetWriterInputMetadataAdaptor to append timed metadata groups to the writer input. The adaptor provides a convenient way to write AVTimedMetadataGroup objects, which package metadata items with their time ranges, rather than working directly with sample buffers:

// Create the metadata adaptor for the display mask metadata's `AVAssetWriterInput`.
metadataAdaptor = AVAssetWriterInputMetadataAdaptor(assetWriterInput: metadataInput)

Finally, the method adds the metadata input to the asset writer and establishes a track association between the metadata track and the video track. The render metadata source association (rndr) is required so the playback system knows which video track the display mask metadata applies to. The playback system ignores this metadata when this association doesn’t exist.

if writer.canAdd(metadataInput) {
    writer.add(metadataInput)

    // Add the `rndr` track association between the display mask metadata track and
    // the enabled video track.
    metadataInput.addTrackAssociation(withTrackOf: videoInput, type: AVAssetTrack.AssociationType.renderMetadataSource.rawValue)
} else {
    throw ProcessingError.cannotAddWriterInput("DisplayMask metadata.")
}

Write display mask rectangle metadata

The setupDisplayMaskMetadataTransfer(videoInfo:maskType:) method handles writing the display mask metadata samples to the timed metadata track. This method uses the video track’s dimensions to calculate the raster rectangle parameters and writes the metadata samples using the adaptor created earlier.

Depending on the display mask type you specify at the command line, the method takes one of two paths:

Type 1 creates a static display mask —– a single centered square that remains fixed for the video’s duration.
Type 2 creates a dynamic display mask —– a smaller square that moves across the frame, with a new metadata sample for each video frame.

To see the specific calculations each path uses, see MovieProcessor.swift file in the sample project and look at Type 1 static display mask calculation. and Type 2 dynamic display mask initialization/update calculation. marks.

Despite writing different metadata, both paths follow a similar pattern. They request data from the metadata input when it’s ready, create a timed metadata group for the display mask rectangle, and append it to the metadata adaptor:

metadataInput.requestMediaDataWhenReady(on: queue) {
    while metadataInput.isReadyForMoreMediaData {
        let rasterRectangle = // Calculate the raster rectangle parameters for this media sample.

        // Create the timed metadata group and append it.
        let metadataGroup = self.createMetadataGroupForDisplayMask(
            rasterRectangle: rasterRectangle,
            sampleTime: sampleTime,
            sampleDuration: sampleDuration
        )
        // Append the metadata group.
        metadataAdaptor.append(metadataGroup)
    }
}

The createMetadataGroupForDisplayMask(rasterRectangle:sampleTime:sampleDuration:) method creates the timed metadata group:

private func createMetadataGroupForDisplayMask(rasterRectangle: [Int],
                sampleTime: CMTime, sampleDuration: CMTime) -> AVTimedMetadataGroup {
    let metadataItem = AVMutableMetadataItem()
    metadataItem.identifier = AVMetadataIdentifier(
        kCMMetadataIdentifier_QuickTimeMetadataDisplayMaskRectangleMono as String)
    metadataItem.value = rasterRectangle as NSArray
    metadataItem.dataType = kCMMetadataBaseDataType_RasterRectangleValue as String

    // Wrap the metadata item in `AVTimedMetadataGroup`.
    let timedMetadataGroup = AVTimedMetadataGroup(
        items: [metadataItem],
        timeRange: CMTimeRange(start: sampleTime, duration: sampleDuration)
    )

    return timedMetadataGroup
}

This method creates an AVMutableMetadataItem with the display mask identifier and data type, sets its value to the raster rectangle array, and wraps it in an AVTimedMetadataGroup with the specified time range. The rasterRectangle parameter is an array of six integers: [rasterWidth, rasterHeight, left, width, top, height]. The time range determines when this metadata applies during playback:

For a static mask, time range spans the entire video duration.
For a dynamic mask, time range matches the duration of a single video frame.

When the asset writer finishes, the output file contains the original video with the display mask rectangle metadata track. On visionOS 26 or later, you can play the file in apps like Files to see the display mask effect.