Contents

Vision

Analyze image and video content in your app using computer vision algorithms for object detection, text recognition, and image segmentation.

Overview

The Vision framework provides pretrained machine learning models for computer vision tasks. Use Vision to analyze still images and video for a variety of purposes, including:

  • Recognizing text in 26 languages across everyday objects, documents, and photos

  • Detecting barcodes and QR codes

  • Detecting faces and analyzing facial features

  • Isolating people and foreground objects with subject lifting

  • Tracking body poses of people and animals for action and gesture recognition

  • Classifying images for categorization and search

  • Measuring image quality and comparing visual similarity

[Image]

All Vision analysis tasks follow the same steps: create a request, perform it on an image or video frame, and read the resulting observations. For example, to detect text in an image, you create a request for the type of analysis you want to perform. Each request conforms to the VisionRequest protocol.

let request = RecognizeTextRequest()
let observations = try await request.perform(on: imageData)

// Store observations for use in your app
var scannedText: [String] = []

for observation in observations {
    scannedText.append(observation.transcript)
}

The request returns an array of observation objects that contain the image-analysis results. Each observation type provides specific details about the analysis results, such as recognized text, confidence scores, and bounding box locations.

For observations that describe image locations -—- such as face bounding boxes or text regions -—- Vision uses a normalized coordinate system where values range from 0.0 to 1.0, with the origin at the lower-left corner. For more information on coordinate types and conversion helpers, see Image locations and regions.

You can also perform multiple requests on the same image, for more information see ImageRequestHandler in the Request handlers section.

This pattern applies to all Vision requests, whether you’re detecting faces, tracking motion, analyzing image quality, or performing custom analysis with Core ML models. Each request type returns observations specific to its analysis task.

Topics

Text and document analysis

Facial analysis

Image segmentation and subject lifting

Pose analysis

Image classification and recognition

Shape and edge detection

Image quality and saliency analysis

Motion and object tracking

Image registration and comparison

Custom Core ML integration

Protocols

Request handlers

Image locations and regions

Errors

Legacy API