soniqo/speech-swift
AI speech models for Apple Silicon, powered by MLX Swift and CoreML.
News
- 19 Apr 2026 — MLX vs CoreML on Apple Silicon — A Practical Guide to Picking the Right Backend
- 20 Mar 2026 — We Beat Whisper Large v3 with a 600M Model Running Entirely on Your Mac
- 26 Feb 2026 — Speaker Diarization and Voice Activity Detection on Apple Silicon — Native Swift with MLX
- 23 Feb 2026 — NVIDIA PersonaPlex 7B on Apple Silicon — Full-Duplex Speech-to-Speech in Native Swift with MLX
- 12 Feb 2026 — Qwen3-ASR Swift: On-Device ASR + TTS for Apple Silicon — Architecture and Benchmarks
Quick start
Add the package to your Package.swift:
.package(url: "https://github.com/soniqo/speech-swift", branch: "main")Import only the modules you need — every model is its own SPM library, so you don't pay for what you don't use:
.product(name: "ParakeetStreamingASR", package: "speech-swift"),
.product(name: "SpeechUI", package: "speech-swift"), // optional SwiftUI viewsTranscribe an audio buffer in 3 lines:
import ParakeetStreamingASR
let model = try await ParakeetStreamingASRModel.fromPretrained()
let text = try model.transcribeAudio(audioSamples, sampleRate: 16000)Live streaming with partials:
for await partial in model.transcribeStream(audio: samples, sampleRate: 16000) {
print(partial.isFinal ? "FINAL: \(partial.text)" : "... \(partial.text)")
}SwiftUI dictation view in ~10 lines:
import SwiftUI
import ParakeetStreamingASR
import SpeechUI
@MainActor
struct DictateView: View {
@State private var store = TranscriptionStore()
var body: some View {
TranscriptionView(finals: store.finalLines, currentPartial: store.currentPartial)
.task {
let model = try? await ParakeetStreamingASRModel.fromPretrained()
guard let model else { return }
for await p in model.transcribeStream(audio: samples, sampleRate: 16000) {
store.apply(text: p.text, isFinal: p.isFinal)
}
}
}
}SpeechUI ships only TranscriptionView (finals + partials) and TranscriptionStore (streaming ASR adapter). Use AVFoundation for audio visualization and playback.
Available SPM products: Qwen3ASR, Qwen3TTS, Qwen3TTSCoreML, ParakeetASR, ParakeetStreamingASR, NemotronStreamingASR, OmnilingualASR, KokoroTTS, CosyVoiceTTS, PersonaPlex, SpeechVAD, SpeechEnhancement, SourceSeparation, Qwen3Chat, SpeechCore, SpeechUI, AudioCommon.
Models
Compact view below. Full model catalogue with sizes, quantisations, download URLs, and memory tables → soniqo.audio/architecture.
| Model | Task | Backends | Sizes | Languages | |-------|------|----------|-------|-----------| | Qwen3-ASR | Speech → Text | MLX, CoreML (hybrid) | 0.6B, 1.7B | 52 | | Parakeet TDT | Speech → Text | CoreML (ANE) | 0.6B | 25 European | | Parakeet EOU | Speech → Text (streaming) | CoreML (ANE) | 120M | 25 European | | Nemotron Streaming | Speech → Text (streaming, punctuated) | CoreML (ANE) | 0.6B | EN | | Omnilingual ASR | Speech → Text | CoreML (ANE), MLX | 300M / 1B / 3B / 7B | 1,672 | | Qwen3-ForcedAligner | Audio + Text → Timestamps | MLX, CoreML | 0.6B | Multi | | Qwen3-TTS | Text → Speech | MLX, CoreML | 0.6B, 1.7B | 10 | | CosyVoice3 | Text → Speech | MLX | 0.5B | 9 | | Kokoro-82M | Text → Speech | CoreML (ANE) | 82M | 10 | | Qwen3.5-Chat | Text → Text (LLM) | MLX, CoreML | 0.8B | Multi | | PersonaPlex | Speech → Speech | MLX | 7B | EN | | Silero VAD | Voice Activity Detection | MLX, CoreML | 309K | Agnostic | | Pyannote | VAD + Diarization | MLX | 1.5M | Agnostic | | Sortformer | Diarization (E2E) | CoreML (ANE) | — | Agnostic | | DeepFilterNet3 | Speech Enhancement | CoreML | 2.1M | Agnostic | | Open-Unmix | Source Separation | MLX | 8.6M | Agnostic | | WeSpeaker | Speaker Embedding | MLX, CoreML | 6.6M | Agnostic |
Installation
Homebrew
Requires native ARM Homebrew (/opt/homebrew). Rosetta/x86_64 Homebrew is not supported.
brew tap soniqo/speech https://github.com/soniqo/speech-swift
brew install speechThen:
audio transcribe recording.wav
audio speak "Hello world"
audio respond --input question.wav --transcript
audio-server --port 8080 # local HTTP / WebSocket server (OpenAI-compatible /v1/realtime)Swift Package Manager
dependencies: [
.package(url: "https://github.com/soniqo/speech-swift", branch: "main")
]Import only what you need — every model is its own SPM target:
import Qwen3ASR // Speech recognition (MLX)
import ParakeetASR // Speech recognition (CoreML, batch)
import ParakeetStreamingASR // Streaming dictation with partials + EOU
import NemotronStreamingASR // English streaming ASR with native punctuation (0.6B)
import OmnilingualASR // 1,672 languages (CoreML + MLX)
import Qwen3TTS // Text-to-speech
import CosyVoiceTTS // Text-to-speech with voice cloning
import KokoroTTS // Text-to-speech (iOS-ready)
import Qwen3Chat // On-device LLM chat
import PersonaPlex // Full-duplex speech-to-speech
import SpeechVAD // VAD + speaker diarization + embeddings
import SpeechEnhancement // Noise suppression
import SourceSeparation // Music source separation (Open-Unmix, 4 stems)
import SpeechUI // SwiftUI components for streaming transcripts
import AudioCommon // Shared protocols and utilitiesRequirements
- Swift 6+, Xcode 16+ (with Metal Toolchain)
- macOS 15+ (Sequoia) or iOS 18+, Apple Silicon (M1/M2/M3/M4)
The macOS 15 / iOS 18 minimum comes from MLState — Apple's persistent ANE state API used by the CoreML pipelines (Qwen3-ASR, Qwen3-Chat, Qwen3-TTS) to keep KV caches resident on the Neural Engine across token steps.
Build from source
git clone https://github.com/soniqo/speech-swift
cd speech-swift
make buildmake build compiles the Swift package and the MLX Metal shader library. The Metal library is required for GPU inference — without it you'll see Failed to load the default metallib at runtime. make debug for debug builds, make test for the test suite.
Demo apps
- DictateDemo (docs) — macOS menu-bar streaming dictation with live partials, VAD-driven end-of-utterance detection, and one-click copy. Runs as a background agent (Parakeet-EOU-120M + Silero VAD).
- iOSEchoDemo — iOS echo demo (Parakeet ASR + Kokoro TTS). Device and simulator.
- PersonaPlexDemo — Conversational voice assistant with mic input, VAD, and multi-turn context. macOS. RTF ~0.94 on M2 Max (faster than real-time).
- SpeechDemo — Dictation and TTS synthesis in a tabbed interface. macOS.
Each demo's README has build instructions.
Code examples
The snippets below show the minimal path for each domain. Every section links to a full guide on soniqo.audio with configuration options, multiple backends, streaming patterns, and CLI recipes.
Speech-to-Text — full guide →
import Qwen3ASR
let model = try await Qwen3ASRModel.fromPretrained()
let text = model.transcribe(audio: audioSamples, sampleRate: 16000)Alternative backends: Parakeet TDT (CoreML, 32× realtime), Omnilingual ASR (1,672 languages, CoreML or MLX), Streaming dictation (live partials).
Forced Alignment — full guide →
import Qwen3ASR
let aligner = try await Qwen3ForcedAligner.fromPretrained()
let aligned = aligner.align(
audio: audioSamples,
text: "Can you guarantee that the replacement part will be shipped tomorrow?",
sampleRate: 24000
)
for word in aligned {
print("[\(word.startTime)s - \(word.endTime)s] \(word.text)")
}Text-to-Speech — full guide →
import Qwen3TTS
import AudioCommon
let model = try await Qwen3TTSModel.fromPretrained()
let audio = model.synthesize(text: "Hello world", language: "english")
try WAVWriter.write(samples: audio, sampleRate: 24000, to: outputURL)Alternative TTS engines: CosyVoice3 (streaming + voice cloning + emotion tags), Kokoro-82M (iOS-ready, 54 voices), Voice cloning.
Speech-to-Speech — full guide →
import PersonaPlex
let model = try await PersonaPlexModel.fromPretrained()
let responseAudio = model.respond(userAudio: userSamples)
// 24 kHz mono Float32 output ready for playbackLLM Chat — full guide →
import Qwen3Chat
let chat = try await Qwen35MLXChat.fromPretrained()
chat.chat(messages: [(.user, "Explain MLX in one sentence")]) { token, isFinal in
print(token, terminator: "")
}Voice Activity Detection — full guide →
import SpeechVAD
let vad = try await SileroVADModel.fromPretrained()
let segments = vad.detectSpeech(audio: samples, sampleRate: 16000)
for s in segments { print("\(s.startTime)s → \(s.endTime)s") }Speaker Diarization — full guide →
import SpeechVAD
let diarizer = try await DiarizationPipeline.fromPretrained()
let segments = diarizer.diarize(audio: samples, sampleRate: 16000)
for s in segments { print("Speaker \(s.speakerId): \(s.startTime)s - \(s.endTime)s") }Speech Enhancement — full guide →
import SpeechEnhancement
let denoiser = try await DeepFilterNet3Model.fromPretrained()
let clean = try denoiser.enhance(audio: noisySamples, sampleRate: 48000)Voice Pipeline (ASR → LLM → TTS) — full guide →
import SpeechCore
let pipeline = VoicePipeline(
stt: parakeetASR,
tts: qwen3TTS,
vad: sileroVAD,
config: .init(mode: .voicePipeline),
onEvent: { event in print(event) }
)
pipeline.start()
pipeline.pushAudio(micSamples)VoicePipeline is the real-time voice-agent state machine (powered by speech-core) with VAD-driven turn detection, interruption handling, and eager STT. It connects any SpeechRecognitionModel + SpeechGenerationModel + StreamingVADProvider.
HTTP API server
audio-server --port 8080Exposes every model via HTTP REST + WebSocket endpoints, including an OpenAI Realtime API-compatible WebSocket at /v1/realtime. See Sources/AudioServer/.
Architecture
speech-swift is split into one SPM target per model so consumers only pay for what they import. Shared infrastructure lives in AudioCommon (protocols, audio I/O, HuggingFace downloader, SentencePieceModel) and MLXCommon (weight loading, QuantizedLinear helpers, SDPA multi-head attention helper).
Full architecture diagram with backends, memory tables, and module map → soniqo.audio/architecture · API reference → soniqo.audio/api · Benchmarks → soniqo.audio/benchmarks
Local docs (repo):
- Models: Qwen3-ASR · Qwen3-TTS · CosyVoice · Kokoro · Parakeet TDT · Parakeet Streaming · Nemotron Streaming · Omnilingual ASR · PersonaPlex · FireRedVAD · Source Separation
- Inference: Qwen3-ASR · Parakeet TDT · Parakeet Streaming · Nemotron Streaming · Omnilingual ASR · TTS · Forced Aligner · Silero VAD · Speaker Diarization · Speech Enhancement
- Reference: Shared Protocols
Cache configuration
Model weights download from HuggingFace on first use and cache to ~/Library/Caches/qwen3-speech/. Override with QWEN3_CACHE_DIR (CLI) or cacheDir: (Swift API). All fromPretrained() entry points also accept offlineMode: true to skip network when weights are already cached.
See docs/inference/cache-and-offline.md for full details including sandboxed iOS container paths.
MLX Metal library
If you see Failed to load the default metallib at runtime, the Metal shader library is missing. Run make build or ./scripts/build_mlx_metallib.sh release after a manual swift build. If the Metal Toolchain is missing, install it first:
xcodebuild -downloadComponent MetalToolchainTesting
make test # full suite (unit + E2E with model downloads)
swift test --skip E2E # unit only (CI-safe, no downloads)
swift test --filter Qwen3ASRTests # specific moduleE2E test classes use the E2E prefix so CI can filter them out with --skip E2E. See CLAUDE.md for the full testing convention.
Contributing
PRs welcome — bug fixes, new model integrations, documentation. Fork, create a feature branch, make build && make test, open a PR against main.
License
Apache 2.0
Package Metadata
Repository: soniqo/speech-swift
Default branch: main
README: README.md