jkrukowski/swift-embeddings
Run embedding models locally in Swift using MLTensor.
Supported Models Architectures
BERT (Bidirectional Encoder Representations from Transformers)
Some of the supported models on Hugging Face:
- sentence-transformers/all-MiniLM-L6-v2
- sentence-transformers/msmarco-bert-base-dot-v5
- sentence-transformers/LaBSE
- thenlper/gte-base
- google-bert/bert-base-uncased
NOTE: google-bert/bert-base-uncased is supported but weightKeyTransform must be provided in the LoadConfig:
let modelBundle = try await Bert.loadModelBundle(
from: "google-bert/bert-base-uncased",
loadConfig: .googleBert
)ModernBERT
Some of the supported models on Hugging Face:
NOTE: answerdotai/ModernBERT-base is supported but weights must be prefixed with model.
let modelBundle = try await Bert.loadModelBundle(
from: "answerdotai/ModernBERT-base",
loadConfig: .addWeightKeyPrefix("model.")
)NomicBERT (Nomic Embed)
Some of the supported models on Hugging Face:
RoBERTa (Robustly Optimized BERT Approach)
Some of the supported models on Hugging Face:
NOTE: Weights in FacebookAI/roberta-base must be prefixed with roberta., this has to be provided in the LoadConfig:
let modelBundle = try await Roberta.loadModelBundle(
from: "FacebookAI/roberta-base",
loadConfig: .addWeightKeyPrefix("roberta.")
)XLM-RoBERTa (Cross-lingual Language Model - Robustly Optimized BERT Approach)
Some of the supported models on Hugging Face:
- FacebookAI/xlm-roberta-base
- intfloat/multilingual-e5-small
- sentence-transformers/paraphrase-multilingual-mpnet-base-v2
- tomaarsen/xlm-roberta-base-multilingual-en-ar-fr-de-es-tr-it
NOTE: Weights in FacebookAI/xlm-roberta-base must be prefixed with roberta., this has to be provided in the LoadConfig:
let modelBundle = try await XLMRoberta.loadModelBundle(
from: "FacebookAI/xlm-roberta-base",
loadConfig: .addWeightKeyPrefix("roberta.")
)Qwen3 (Qwen3 Embedding)
Decoder-only embedding models that use last-token pooling. By default encode / batchEncode apply .lastTokenPool(normalize: true). Some of the supported models on Hugging Face:
let modelBundle = try await Qwen3.loadModelBundle(
from: "jkrukowski/Qwen3-Embedding-0.6B-F32"
)
let encoded = try modelBundle.encode("The cat is black")NOTE: The original Qwen3 checkpoints (e.g. Qwen/Qwen3-Embedding-0.6B) are stored in BF16, which MLTensor cannot represent. Use a Float32 conversion of the weights.
CLIP (Contrastive Language–Image Pre-training)
NOTE: only text encoding is supported for now. Some of the supported models on Hugging Face:
Word2Vec
NOTE: it's a word embedding model. It loads and keeps the whole model in memory. For the more memory efficient solution, you might want to use SQLiteVec. Some of the supported models on Hugging Face:
- jkrukowski/glove-twitter-25
- jkrukowski/glove-twitter-50
- jkrukowski/glove-twitter-100
- jkrukowski/glove-twitter-200
Model2Vec
More info here.
Some of the supported models on Hugging Face:
- minishlab/potion-base-2M
- minishlab/potion-base-4M
- minishlab/potion-base-8M
- minishlab/potion-retrieval-32M
- minishlab/potion-base-32M
- minishlab/M2V_base_output
Static Embeddings
More info here.
Some of the supported models on Hugging Face:
Installation
Add the following to your Package.swift file. In the package dependencies add:
dependencies: [
.package(url: "https://github.com/jkrukowski/swift-embeddings", from: "0.0.16")
]In the target dependencies add:
dependencies: [
.product(name: "Embeddings", package: "swift-embeddings")
]Usage
Encoding
import Embeddings
// load model and tokenizer from Hugging Face
let modelBundle = try await Bert.loadModelBundle(
from: "sentence-transformers/all-MiniLM-L6-v2"
)
// encode text
let encoded = try modelBundle.encode("The cat is black")
let result = await encoded.cast(to: Float.self).shapedArray(of: Float.self).scalars
// print result
print(result)Batch Encoding
import Embeddings
import MLTensorUtils
let texts = [
"The cat is black",
"The dog is black",
"The cat sleeps well"
]
let modelBundle = try await Bert.loadModelBundle(
from: "sentence-transformers/all-MiniLM-L6-v2"
)
let encoded = try modelBundle.batchEncode(texts)
let distance = cosineDistance(encoded, encoded)
let result = await distance.cast(to: Float.self).shapedArray(of: Float.self).scalars
print(result)Encode Options
The transformer model bundles (Bert, ModernBert, NomicBert, Qwen3, Roberta, XLMRoberta, Clip) share a single EncodeOptions struct on encode / batchEncode:
maxLength— maximum number of tokens per input sequence.padTokenId— padding token id used bybatchEncode(niluses the model's own id).postProcess— how the token-level output is pooled (see below).computePolicy— theMLComputePolicyfor the underlyingMLTensorwork.
When you don't pass options, each bundle uses its defaultEncodeOptions. These are derived at load time from the model's sentence-transformers configuration (modules.json, Pooling/config.json, the sentence_config.json transformer config, and the tokenizer's model_max_length) when present — so a sentence-transformers model is pooled correctly out of the box (e.g. nomic-ai/nomic-embed-text-v1.5 → .meanPool(normalize: false), nomic-ai/modernbert-embed-base → .meanPool(normalize: true)). For plain checkpoints with no such configuration, the bundle falls back to a per-model static default (.bert, .modernBert, .nomicBert, .qwen3, .roberta, .xlmRoberta, .clip). Clip has no sentence-transformers layout, so it always uses .clip.
To override, start from a static default (or bundle.defaultEncodeOptions) and set the fields you need:
var options = EncodeOptions.bert
options.postProcess = .meanPool(normalize: true)
options.computePolicy = .cpuOnly
let encoded = try modelBundle.encode("The cat is black", options: options)Pooling (postProcess)
.clsTokenPool(normalize:)— use the first ([CLS]) token..meanPool(normalize:)— average the (attention-masked) tokens..lastTokenPool(normalize:)— use the last non-padding token.nil— return the raw token-level output without pooling.
(normalize: true L2-normalizes the pooled embedding.)
When not derived from a sentence-transformers config, pooling defaults match each model's convention (.clsTokenPool(normalize: false) for Bert / Roberta / XLMRoberta, .lastTokenPool(normalize: true) for Qwen3, nil for ModernBert / NomicBert).
Compute policy (computePolicy)
Defaults to .cpuAndGPU, which also works around an occasional crash in the underlying BNNS matmul implementation, see more here. Use .cpuOnly if you observe high memory usage.
Clip also uses EncodeOptions, but its pooling is fixed (EOS token → projection → L2 normalize), so postProcess is ignored.
The embedding-table bundles (Model2Vec, StaticEmbeddings) take their relevant parameters (normalize, truncateDimension, maxLength, computePolicy) directly on encode / batchEncode.
Command Line Demo
To run the command line demo, use the following command:
swift run embeddings-cli <subcommand> [--model-id <model-id>] [--model-file <model-file>] [--text <text>] [--max-length <max-length>]Subcommands:
bert Encode text using BERT model
modern-bert Encode text using ModernBERT model
nomic-bert Encode text using Nomic embedding model
qwen3 Encode text using Qwen3 embedding model
clip Encode text using CLIP model
model2vec Encode text using Model2Vec model
roberta Encode text using RoBERTa model
static-embeddings Encode text using Static Embeddings model
xlm-roberta Encode text using XLMRoberta model
word2vec Encode word using Word2Vec modelCommand line options:
--model-id <model-id> Id of the model to use
--model-file <model-file> Path to the model file (only for `Word2Vec`)
--text <text> Text to encode
--max-length <max-length> Maximum length of the input (not for `Word2Vec`)
-h, --help Show help information.Tests
Run the unit tests with:
swift testThe accuracy tests compare the Swift output against a Python/transformers reference. They are skipped unless UV_PATH points at a uv binary (used to run the reference script). Run them with --no-parallel, since running multiple model forward passes concurrently can crash the underlying MLTensor/BNNS matmul:
PYTORCH_ENABLE_MPS_FALLBACK=1 UV_PATH=$(which uv) CI_DISABLE_NETWORK_MONITOR=1 \
swift test --no-parallel --filter AccuracyTestsPYTORCH_ENABLE_MPS_FALLBACK=1lets thePyTorchreference fall back to CPU for ops
unsupported on MPS.
CI_DISABLE_NETWORK_MONITOR=1disables the network monitor (models are downloaded
before the comparison runs).
Benchmarks
The Benchmarks package, built with package-benchmark, measures the encode performance of each model architecture. Run all benchmarks with:
swift package --package-path Benchmarks --disable-sandbox benchmark--disable-sandbox is required because each benchmark downloads its model in setup (cached after the first run). To list the available benchmarks:
swift package --package-path Benchmarks benchmark listCode Formatting
This project uses swift-format. To format the code run:
swift format . -i -r --configuration .swift-formatAcknowledgements
This project is based on and uses some of the code from:
Package Metadata
Repository: jkrukowski/swift-embeddings
Default branch: main
README: README.md