Vision
Analyze image and video content in your app using computer vision algorithms for object detection, text recognition, and image segmentation.
Overview
The Vision framework provides pretrained machine learning models for computer vision tasks. Use Vision to analyze still images and video for a variety of purposes, including:
Recognizing text in 26 languages across everyday objects, documents, and photos
Detecting barcodes and QR codes
Detecting faces and analyzing facial features
Isolating people and foreground objects with subject lifting
Tracking body poses of people and animals for action and gesture recognition
Classifying images for categorization and search
Measuring image quality and comparing visual similarity
[Image]
All Vision analysis tasks follow the same steps: create a request, perform it on an image or video frame, and read the resulting observations. For example, to detect text in an image, you create a request for the type of analysis you want to perform. Each request conforms to the VisionRequest protocol.
let request = RecognizeTextRequest()
let observations = try await request.perform(on: imageData)
// Store observations for use in your app
var scannedText: [String] = []
for observation in observations {
scannedText.append(observation.transcript)
}The request returns an array of observation objects that contain the image-analysis results. Each observation type provides specific details about the analysis results, such as recognized text, confidence scores, and bounding box locations.
For observations that describe image locations -—- such as face bounding boxes or text regions -—- Vision uses a normalized coordinate system where values range from 0.0 to 1.0, with the origin at the lower-left corner. For more information on coordinate types and conversion helpers, see Image locations and regions.
You can also perform multiple requests on the same image, for more information see ImageRequestHandler in the Request handlers section.
This pattern applies to all Vision requests, whether you’re detecting faces, tracking motion, analyzing image quality, or performing custom analysis with Core ML models. Each request type returns observations specific to its analysis task.
Topics
Text and document analysis
Locating and displaying recognized textRecognizing tables within a documentDetectBarcodesRequestDetectDocumentSegmentationRequestDetectTextRectanglesRequestRecognizeDocumentsRequestRecognizeTextRequest
Facial analysis
Analyzing a selfie and visualizing its contentDetectFaceCaptureQualityRequestDetectFaceLandmarksRequestDetectFaceRectanglesRequest
Image segmentation and subject lifting
GenerateForegroundInstanceMaskRequestGeneratePersonInstanceMaskRequestGeneratePersonSegmentationRequest
Pose analysis
DetectAnimalBodyPoseRequestDetectHumanBodyPose3DRequestDetectHumanBodyPoseRequestDetectHumanHandPoseRequestSupporting Pose Types
Image classification and recognition
Classifying images for categorization and searchClassifyImageRequestDetectHumanRectanglesRequestRecognizeAnimalsRequest
Shape and edge detection
Image quality and saliency analysis
Generating high-quality thumbnails from videosCalculateImageAestheticsScoresRequestDetectLensSmudgeRequestGenerateAttentionBasedSaliencyImageRequestGenerateObjectnessBasedSaliencyImageRequest
Motion and object tracking
Image registration and comparison
GenerateImageFeaturePrintRequestTrackHomographicImageRegistrationRequestTrackTranslationalImageRegistrationRequest
Custom Core ML integration
Protocols
Request handlers
Image locations and regions
NormalizedPointNormalizedRectNormalizedRegionNormalizedCircleBoundingBoxProvidingBoundingRegionProvidingQuadrilateralProvidingCoordinateOrigin