soniqo/speech-swift

AI speech models for Apple Silicon, powered by MLX Swift and CoreML.

News

19 Apr 2026 — MLX vs CoreML on Apple Silicon — A Practical Guide to Picking the Right Backend
20 Mar 2026 — We Beat Whisper Large v3 with a 600M Model Running Entirely on Your Mac
26 Feb 2026 — Speaker Diarization and Voice Activity Detection on Apple Silicon — Native Swift with MLX
23 Feb 2026 — NVIDIA PersonaPlex 7B on Apple Silicon — Full-Duplex Speech-to-Speech in Native Swift with MLX
12 Feb 2026 — Qwen3-ASR Swift: On-Device ASR + TTS for Apple Silicon — Architecture and Benchmarks

Quick start

Add the package to your Package.swift:

.package(url: "https://github.com/soniqo/speech-swift", branch: "main")

Import only the modules you need — every model is its own SPM library, so you don't pay for what you don't use:

.product(name: "ParakeetStreamingASR", package: "speech-swift"),
.product(name: "SpeechUI",             package: "speech-swift"),  // optional SwiftUI views

Transcribe an audio buffer in 3 lines:

import ParakeetStreamingASR

let model = try await ParakeetStreamingASRModel.fromPretrained()
let text = try model.transcribeAudio(audioSamples, sampleRate: 16000)

Live streaming with partials:

for await partial in model.transcribeStream(audio: samples, sampleRate: 16000) {
    print(partial.isFinal ? "FINAL: \(partial.text)" : "... \(partial.text)")
}

SwiftUI dictation view in ~10 lines:

import SwiftUI
import ParakeetStreamingASR
import SpeechUI

@MainActor
struct DictateView: View {
    @State private var store = TranscriptionStore()

    var body: some View {
        TranscriptionView(finals: store.finalLines, currentPartial: store.currentPartial)
            .task {
                let model = try? await ParakeetStreamingASRModel.fromPretrained()
                guard let model else { return }
                for await p in model.transcribeStream(audio: samples, sampleRate: 16000) {
                    store.apply(text: p.text, isFinal: p.isFinal)
                }
            }
    }
}

SpeechUI ships only TranscriptionView (finals + partials) and TranscriptionStore (streaming ASR adapter). Use AVFoundation for audio visualization and playback.

Available SPM products: Qwen3ASR, Qwen3TTS, Qwen3TTSCoreML, ParakeetASR, ParakeetStreamingASR, NemotronStreamingASR, OmnilingualASR, KokoroTTS, CosyVoiceTTS, PersonaPlex, SpeechVAD, SpeechEnhancement, SourceSeparation, Qwen3Chat, SpeechCore, SpeechUI, AudioCommon.

Models

Compact view below. Full model catalogue with sizes, quantisations, download URLs, and memory tables → soniqo.audio/architecture.

| Model | Task | Backends | Sizes | Languages | |-------|------|----------|-------|-----------| | Qwen3-ASR | Speech → Text | MLX, CoreML (hybrid) | 0.6B, 1.7B | 52 | | Parakeet TDT | Speech → Text | CoreML (ANE) | 0.6B | 25 European | | Parakeet EOU | Speech → Text (streaming) | CoreML (ANE) | 120M | 25 European | | Nemotron Streaming | Speech → Text (streaming, punctuated) | CoreML (ANE) | 0.6B | EN | | Omnilingual ASR | Speech → Text | CoreML (ANE), MLX | 300M / 1B / 3B / 7B | 1,672 | | Qwen3-ForcedAligner | Audio + Text → Timestamps | MLX, CoreML | 0.6B | Multi | | Qwen3-TTS | Text → Speech | MLX, CoreML | 0.6B, 1.7B | 10 | | CosyVoice3 | Text → Speech | MLX | 0.5B | 9 | | Kokoro-82M | Text → Speech | CoreML (ANE) | 82M | 10 | | Qwen3.5-Chat | Text → Text (LLM) | MLX, CoreML | 0.8B | Multi | | PersonaPlex | Speech → Speech | MLX | 7B | EN | | Silero VAD | Voice Activity Detection | MLX, CoreML | 309K | Agnostic | | Pyannote | VAD + Diarization | MLX | 1.5M | Agnostic | | Sortformer | Diarization (E2E) | CoreML (ANE) | — | Agnostic | | DeepFilterNet3 | Speech Enhancement | CoreML | 2.1M | Agnostic | | Open-Unmix | Source Separation | MLX | 8.6M | Agnostic | | WeSpeaker | Speaker Embedding | MLX, CoreML | 6.6M | Agnostic |

Installation

Homebrew

Requires native ARM Homebrew (/opt/homebrew). Rosetta/x86_64 Homebrew is not supported.

brew tap soniqo/speech https://github.com/soniqo/speech-swift
brew install speech

Then:

audio transcribe recording.wav
audio speak "Hello world"
audio respond --input question.wav --transcript
audio-server --port 8080            # local HTTP / WebSocket server (OpenAI-compatible /v1/realtime)

Full CLI reference →

Swift Package Manager

dependencies: [
    .package(url: "https://github.com/soniqo/speech-swift", branch: "main")
]

Import only what you need — every model is its own SPM target:

import Qwen3ASR             // Speech recognition (MLX)
import ParakeetASR          // Speech recognition (CoreML, batch)
import ParakeetStreamingASR // Streaming dictation with partials + EOU
import NemotronStreamingASR // English streaming ASR with native punctuation (0.6B)
import OmnilingualASR       // 1,672 languages (CoreML + MLX)
import Qwen3TTS             // Text-to-speech
import CosyVoiceTTS         // Text-to-speech with voice cloning
import KokoroTTS            // Text-to-speech (iOS-ready)
import Qwen3Chat            // On-device LLM chat
import PersonaPlex          // Full-duplex speech-to-speech
import SpeechVAD            // VAD + speaker diarization + embeddings
import SpeechEnhancement    // Noise suppression
import SourceSeparation     // Music source separation (Open-Unmix, 4 stems)
import SpeechUI             // SwiftUI components for streaming transcripts
import AudioCommon          // Shared protocols and utilities

Requirements

Swift 6+, Xcode 16+ (with Metal Toolchain)
macOS 15+ (Sequoia) or iOS 18+, Apple Silicon (M1/M2/M3/M4)

The macOS 15 / iOS 18 minimum comes from MLState — Apple's persistent ANE state API used by the CoreML pipelines (Qwen3-ASR, Qwen3-Chat, Qwen3-TTS) to keep KV caches resident on the Neural Engine across token steps.

Build from source

git clone https://github.com/soniqo/speech-swift
cd speech-swift
make build

make build compiles the Swift package and the MLX Metal shader library. The Metal library is required for GPU inference — without it you'll see Failed to load the default metallib at runtime. make debug for debug builds, make test for the test suite.

Full build and install guide →

Demo apps

DictateDemo (docs) — macOS menu-bar streaming dictation with live partials, VAD-driven end-of-utterance detection, and one-click copy. Runs as a background agent (Parakeet-EOU-120M + Silero VAD).
iOSEchoDemo — iOS echo demo (Parakeet ASR + Kokoro TTS). Device and simulator.
PersonaPlexDemo — Conversational voice assistant with mic input, VAD, and multi-turn context. macOS. RTF ~0.94 on M2 Max (faster than real-time).
SpeechDemo — Dictation and TTS synthesis in a tabbed interface. macOS.

Each demo's README has build instructions.

Code examples

The snippets below show the minimal path for each domain. Every section links to a full guide on soniqo.audio with configuration options, multiple backends, streaming patterns, and CLI recipes.

Speech-to-Text — full guide →

import Qwen3ASR

let model = try await Qwen3ASRModel.fromPretrained()
let text = model.transcribe(audio: audioSamples, sampleRate: 16000)

Alternative backends: Parakeet TDT (CoreML, 32× realtime), Omnilingual ASR (1,672 languages, CoreML or MLX), Streaming dictation (live partials).

Forced Alignment — full guide →

import Qwen3ASR

let aligner = try await Qwen3ForcedAligner.fromPretrained()
let aligned = aligner.align(
    audio: audioSamples,
    text: "Can you guarantee that the replacement part will be shipped tomorrow?",
    sampleRate: 24000
)
for word in aligned {
    print("[\(word.startTime)s - \(word.endTime)s] \(word.text)")
}

Text-to-Speech — full guide →

import Qwen3TTS
import AudioCommon

let model = try await Qwen3TTSModel.fromPretrained()
let audio = model.synthesize(text: "Hello world", language: "english")
try WAVWriter.write(samples: audio, sampleRate: 24000, to: outputURL)

Alternative TTS engines: CosyVoice3 (streaming + voice cloning + emotion tags), Kokoro-82M (iOS-ready, 54 voices), Voice cloning.

Speech-to-Speech — full guide →

import PersonaPlex

let model = try await PersonaPlexModel.fromPretrained()
let responseAudio = model.respond(userAudio: userSamples)
// 24 kHz mono Float32 output ready for playback

LLM Chat — full guide →

import Qwen3Chat

let chat = try await Qwen35MLXChat.fromPretrained()
chat.chat(messages: [(.user, "Explain MLX in one sentence")]) { token, isFinal in
    print(token, terminator: "")
}

Voice Activity Detection — full guide →

import SpeechVAD

let vad = try await SileroVADModel.fromPretrained()
let segments = vad.detectSpeech(audio: samples, sampleRate: 16000)
for s in segments { print("\(s.startTime)s → \(s.endTime)s") }

Speaker Diarization — full guide →

import SpeechVAD

let diarizer = try await DiarizationPipeline.fromPretrained()
let segments = diarizer.diarize(audio: samples, sampleRate: 16000)
for s in segments { print("Speaker \(s.speakerId): \(s.startTime)s - \(s.endTime)s") }

Speech Enhancement — full guide →

import SpeechEnhancement

let denoiser = try await DeepFilterNet3Model.fromPretrained()
let clean = try denoiser.enhance(audio: noisySamples, sampleRate: 48000)

Voice Pipeline (ASR → LLM → TTS) — full guide →

import SpeechCore

let pipeline = VoicePipeline(
    stt: parakeetASR,
    tts: qwen3TTS,
    vad: sileroVAD,
    config: .init(mode: .voicePipeline),
    onEvent: { event in print(event) }
)
pipeline.start()
pipeline.pushAudio(micSamples)

VoicePipeline is the real-time voice-agent state machine (powered by speech-core) with VAD-driven turn detection, interruption handling, and eager STT. It connects any SpeechRecognitionModel + SpeechGenerationModel + StreamingVADProvider.

HTTP API server

audio-server --port 8080

Exposes every model via HTTP REST + WebSocket endpoints, including an OpenAI Realtime API-compatible WebSocket at /v1/realtime. See Sources/AudioServer/.

Architecture

speech-swift is split into one SPM target per model so consumers only pay for what they import. Shared infrastructure lives in AudioCommon (protocols, audio I/O, HuggingFace downloader, SentencePieceModel) and MLXCommon (weight loading, QuantizedLinear helpers, SDPA multi-head attention helper).

Full architecture diagram with backends, memory tables, and module map → soniqo.audio/architecture · API reference → soniqo.audio/api · Benchmarks → soniqo.audio/benchmarks

Local docs (repo):

Models: Qwen3-ASR · Qwen3-TTS · CosyVoice · Kokoro · Parakeet TDT · Parakeet Streaming · Nemotron Streaming · Omnilingual ASR · PersonaPlex · FireRedVAD · Source Separation
Inference: Qwen3-ASR · Parakeet TDT · Parakeet Streaming · Nemotron Streaming · Omnilingual ASR · TTS · Forced Aligner · Silero VAD · Speaker Diarization · Speech Enhancement
Reference: Shared Protocols

Cache configuration

Model weights download from HuggingFace on first use and cache to ~/Library/Caches/qwen3-speech/. Override with QWEN3_CACHE_DIR (CLI) or cacheDir: (Swift API). All fromPretrained() entry points also accept offlineMode: true to skip network when weights are already cached.

See docs/inference/cache-and-offline.md for full details including sandboxed iOS container paths.

MLX Metal library

If you see Failed to load the default metallib at runtime, the Metal shader library is missing. Run make build or ./scripts/build_mlx_metallib.sh release after a manual swift build. If the Metal Toolchain is missing, install it first:

xcodebuild -downloadComponent MetalToolchain

Testing

make test                            # full suite (unit + E2E with model downloads)
swift test --skip E2E                # unit only (CI-safe, no downloads)
swift test --filter Qwen3ASRTests    # specific module

E2E test classes use the E2E prefix so CI can filter them out with --skip E2E. See CLAUDE.md for the full testing convention.

Contributing

PRs welcome — bug fixes, new model integrations, documentation. Fork, create a feature branch, make build && make test, open a PR against main.

License

Apache 2.0

Package Metadata

Repository: soniqo/speech-swift

Default branch: main

README: README.md