Blaizzy/mlx-audio
A text-to-speech (TTS), speech-to-text (STT) and speech-to-speech (STS) library built on Apple's MLX framework, providing efficient speech analysis on Apple Silicon.
Features
- Fast inference optimized for Apple Silicon (M series chips)
- Multiple model architectures for TTS, STT, and STS
- Multilingual support across models
- Voice customization and cloning capabilities
- Adjustable speech speed control
- Interactive web interface with 3D audio visualization
- OpenAI-compatible REST API
- Quantization support (3-bit, 4-bit, 6-bit, 8-bit, and more) for optimized performance
- Swift package for iOS/macOS integration
Installation
Using pip
pip install mlx-audioUsing uv to install only the command line tools
Latest release from pypi:
uv tool install --force mlx-audio --prerelease=allowLatest code from github:
uv tool install --force git+https://github.com/Blaizzy/mlx-audio.git --prerelease=allowFor development or web interface:
git clone https://github.com/Blaizzy/mlx-audio.git
cd mlx-audio
pip install -e ".[dev]"Quick Start
Command Line
# Basic TTS generation
mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit --text 'Hello, world!' --voice Chelsie
# With a different voice and language hint
mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit --text 'Welcome to MLX-Audio!' --voice Ethan --lang_code English
# Play audio immediately
mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit --text 'Hello!' --voice Chelsie --play
# Save to a specific directory
mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit --text 'Hello!' --voice Chelsie --output_path ./my_audio
# Stream audio during generation
mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit --text 'Hello!' --voice Chelsie --stream
# Stream audio during generation and save it to disk
mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit --text 'Hello!' --voice Chelsie --stream --save
# Join multiple generated segments into one file
mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit --text $'Hello!\nHow are you?' --voice Chelsie --join_audioBy default, when generation yields multiple segments, mlx-audio saves numbered files such as audio_000.wav and audio_001.wav. Use --join_audio to save one combined file instead. When using --stream, add --save to write the streamed audio to disk.
Python API
from mlx_audio.tts.utils import load_model
# Load model
model = load_model("mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit")
# Generate speech
for result in model.generate(
"Hello from MLX-Audio!",
voice="Chelsie",
lang_code="English",
):
print(f"Generated {result.audio.shape[0]} samples")
# result.audio contains the waveform as mx.arraySupported Models
### Text-to-Speech (TTS)
| Model | Description | Languages | Repo |
|-------|-------------|-----------|------|
| **Kokoro** | Fast, high-quality multilingual TTS | EN, JA, ZH, FR, ES, IT, PT, HI | [bf16](https://huggingface.co/mlx-community/Kokoro-82M-bf16), [8bit](https://huggingface.co/mlx-community/Kokoro-82M-8bit), [6bit](https://huggingface.co/mlx-community/Kokoro-82M-6bit), [4bit](https://huggingface.co/mlx-community/Kokoro-82M-4bit) |
| **Qwen3-TTS** | Alibaba's multilingual TTS with voice design | ZH, EN, JA, KO, + more | [mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16](https://huggingface.co/mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16) |
| **CSM** | Conversational Speech Model with voice cloning | EN | [mlx-community/csm-1b](https://huggingface.co/mlx-community/csm-1b) |
| **Dia** | Dialogue-focused TTS | EN | [mlx-community/Dia-1.6B-fp16](https://huggingface.co/mlx-community/Dia-1.6B-fp16) |
| **OuteTTS** | Efficient TTS model | EN | [mlx-community/OuteTTS-1.0-0.6B-fp16](https://huggingface.co/mlx-community/OuteTTS-1.0-0.6B-fp16) |
| **Spark** | SparkTTS model | EN, ZH | [mlx-community/Spark-TTS-0.5B-bf16](https://huggingface.co/mlx-community/Spark-TTS-0.5B-bf16) |
| **Chatterbox** | Expressive multilingual TTS | EN, ES, FR, DE, IT, PT, PL, TR, RU, NL, CS, AR, ZH, JA, HU, KO | [mlx-community/chatterbox-fp16](https://huggingface.co/mlx-community/chatterbox-fp16) |
| **Soprano** | High-quality TTS | EN | [mlx-community/Soprano-1.1-80M-bf16](https://huggingface.co/mlx-community/Soprano-1.1-80M-bf16) |
| **Ming Omni TTS (BailingMM)** | Multimodal generation with voice cloning, style control, and speech/music/event generation | EN, ZH | [mlx-community/Ming-omni-tts-16.8B-A3B-bf16](https://huggingface.co/mlx-community/Ming-omni-tts-16.8B-A3B-bf16) |
| **Ming Omni TTS (Dense)** | Lightweight dense Ming Omni variant for voice cloning and style control | EN, ZH | [mlx-community/Ming-omni-tts-0.5B-bf16](https://huggingface.co/mlx-community/Ming-omni-tts-0.5B-bf16) |
| **KugelAudio** | SOTA 7B AR+Diffusion TTS for European languages | EN, DE, FR, ES, IT, PT, NL, PL, RU, UK, + 14 more | [kugelaudio/kugelaudio-0-open](https://huggingface.co/kugelaudio/kugelaudio-0-open) |
| **Voxtral TTS** | Mistral's 4B multilingual TTS (20 voices, 9 languages) | EN, FR, ES, DE, IT, PT, NL, AR, HI | [mlx-community/Voxtral-4B-TTS-2603-mlx-bf16](https://huggingface.co/mlx-community/Voxtral-4B-TTS-2603-mlx-bf16) |
| **LongCat-AudioDiT** | SOTA diffusion TTS in waveform latent space with voice cloning | ZH, EN | [mlx-community/LongCat-AudioDiT-1B-bf16](https://huggingface.co/mlx-community/LongCat-AudioDiT-1B-bf16) |
| **MeloTTS** | Lightweight VITS2-based TTS with streaming | EN (more coming) | [mlx-community/MeloTTS-English-MLX](https://huggingface.co/mlx-community/MeloTTS-English-MLX) |
| **Higgs Audio v2** | 3B Llama-backed TTS with real-time voice cloning | EN, ZH, KO, DE, ES | [bf16 (upstream)](https://huggingface.co/bosonai/higgs-audio-v2-generation-3B-base), [q8](https://huggingface.co/mlx-community/higgs-audio-v2-3B-mlx-q8), [q6](https://huggingface.co/mlx-community/higgs-audio-v2-3B-mlx-q6) |
### Speech-to-Text (STT)
| Model | Description | Languages | Repo |
|-------|-------------|-----------|------|
| **Whisper** | OpenAI's robust STT model | 99+ languages | [mlx-community/whisper-large-v3-turbo-asr-fp16](https://huggingface.co/mlx-community/whisper-large-v3-turbo-asr-fp16) |
| **Distil-Whisper** | Distilled fast Whisper variants | EN | [distil-whisper/distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3) |
| **Qwen3-ASR** | Alibaba's multilingual ASR | ZH, EN, JA, KO, + more | [mlx-community/Qwen3-ASR-1.7B-8bit](https://huggingface.co/mlx-community/Qwen3-ASR-1.7B-8bit) |
| **Qwen3-ForcedAligner** | Word-level audio alignment | ZH, EN, JA, KO, + more | [mlx-community/Qwen3-ForcedAligner-0.6B-8bit](https://huggingface.co/mlx-community/Qwen3-ForcedAligner-0.6B-8bit) |
| **Parakeet** | NVIDIA's accurate STT | EN (v2), 25 EU languages (v3) | [mlx-community/parakeet-tdt-0.6b-v3](https://huggingface.co/mlx-community/parakeet-tdt-0.6b-v3) |
| **Voxtral** | Mistral's speech model | Multiple | [mlx-community/Voxtral-Mini-3B-2507-bf16](https://huggingface.co/mlx-community/Voxtral-Mini-3B-2507-bf16) |
| **Voxtral Realtime** | Mistral's 4B streaming STT | Multiple | [4bit](https://huggingface.co/mlx-community/Voxtral-Mini-4B-Realtime-2602-4bit), [fp16](https://huggingface.co/mlx-community/Voxtral-Mini-4B-Realtime-2602-fp16) |
| **VibeVoice-ASR** | Microsoft's 9B ASR with diarization & timestamps | Multiple | [mlx-community/VibeVoice-ASR-bf16](https://huggingface.co/mlx-community/VibeVoice-ASR-bf16) |
| **Canary** | NVIDIA's multilingual ASR with translation | 25 EU + RU, UK | [README](mlx_audio/stt/models/canary/README.md) |
| **Moonshine** | Useful Sensors' lightweight ASR | EN | [README](mlx_audio/stt/models/moonshine/README.md) |
| **MMS** | Meta's massively multilingual ASR with adapters | 1000+ | [README](mlx_audio/stt/models/mms/README.md) |
| **Granite Speech** | IBM's ASR + speech translation | EN, FR, DE, ES, PT, JA | [README](mlx_audio/stt/models/granite_speech/README.md) |
| **Qwen2-Audio** | Alibaba's multimodal audio understanding (ASR, captioning, emotion, translation) | Multiple | [mlx-community/Qwen2-Audio-7B-Instruct-4bit](https://huggingface.co/mlx-community/Qwen2-Audio-7B-Instruct-4bit) |
### Voice Activity Detection / Speaker Diarization (VAD)
| Model | Description | Languages | Repo |
|-------|-------------|-----------|------|
| **Sortformer v1** | NVIDIA's end-to-end speaker diarization (up to 4 speakers) | Language-agnostic | [mlx-community/diar_sortformer_4spk-v1-fp32](https://huggingface.co/mlx-community/diar_sortformer_4spk-v1-fp32) |
| **Sortformer v2.1** | NVIDIA's streaming speaker diarization with AOSC compression | Language-agnostic | [mlx-community/diar_streaming_sortformer_4spk-v2.1-fp32](https://huggingface.co/mlx-community/diar_streaming_sortformer_4spk-v2.1-fp32) |
See the [Sortformer README](mlx_audio/vad/models/sortformer/README.md) for API details, streaming examples, and model conversion.
### Speech-to-Speech (STS)
| Model | Description | Use Case | Repo |
|-------|-------------|----------|------|
| **SAM-Audio** | Text-guided source separation | Extract specific sounds | [mlx-community/sam-audio-large](https://huggingface.co/mlx-community/sam-audio-large) |
| **Liquid2.5-Audio*** | Speech-to-Speech, Text-to-Speech and Speech-to-Text | Speech interactions | [mlx-community/LFM2.5-Audio-1.5B-8bit](https://huggingface.co/mlx-community/LFM2.5-Audio-1.5B-8bit) |
| **MossFormer2 SE** | Speech enhancement | Noise removal | [starkdmi/MossFormer2_SE_48K_MLX](https://huggingface.co/starkdmi/MossFormer2_SE_48K_MLX) |
| **DeepFilterNet (1/2/3)** | Speech enhancement | Noise suppression | [mlx-community/DeepFilterNet-mlx](https://huggingface.co/mlx-community/DeepFilterNet-mlx) |Model Examples
### Qwen3-TTS
Alibaba's state-of-the-art multilingual TTS with voice cloning, emotion control, and voice design capabilities.
```python
from mlx_audio.tts.utils import load_model
model = load_model("mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16")
results = list(model.generate(
text="Hello, welcome to MLX-Audio!",
voice="Chelsie",
language="English",
))
audio = results[0].audio # mx.array
```
See the [Qwen3-TTS README](mlx_audio/tts/models/qwen3_tts/README.md) for voice cloning, CustomVoice, VoiceDesign, and all available models.
### Ming Omni TTS (BailingMM)
```bash
mlx_audio.tts.generate \
--model mlx-community/Ming-omni-tts-16.8B-A3B-bf16 \
--prompt "Please generate speech based on the following description.\n" \
--text "This is a quick Ming Omni test." \
--lang_code en \
--output_path audio_io \
--file_prefix ming_basic \
--verbose
```
See the [Ming Omni TTS README](mlx_audio/tts/models/bailingmm/README.md) for CLI and Python cookbook examples, and the [Ming Omni Dense README](mlx_audio/tts/models/dense/README.md) for the `mlx-community/Ming-omni-tts-0.5B-bf16` workflow.
### Kokoro TTS
Kokoro is a fast, multilingual TTS model with 54 voice presets.
```python
from mlx_audio.tts.utils import load_model
model = load_model("mlx-community/Kokoro-82M-bf16")
# Or use a quantized variant for lower memory usage:
# model = load_model("mlx-community/Kokoro-82M-8bit")
# model = load_model("mlx-community/Kokoro-82M-4bit")
# Generate with different voices
for result in model.generate(
text="Welcome to MLX-Audio!",
voice="af_heart", # American female
speed=1.0,
lang_code="a" # American English
):
audio = result.audio
```
**Available Voices:**
- American English: `af_heart`, `af_bella`, `af_nova`, `af_sky`, `am_adam`, `am_echo`, etc.
- British English: `bf_alice`, `bf_emma`, `bm_daniel`, `bm_george`, etc.
- Japanese: `jf_alpha`, `jm_kumo`, etc.
- Chinese: `zf_xiaobei`, `zm_yunxi`, etc.
Kokoro requires `pip install misaki` for text processing. Japanese and Mandarin may additionally require `pip install misaki[ja]` or `pip install misaki[zh]`.
**Language Codes:**
| Code | Language | Note |
|------|----------|------|
| `a` | American English | Default; requires `pip install misaki` |
| `b` | British English | Requires `pip install misaki` |
| `j` | Japanese | Requires `pip install misaki[ja]` |
| `z` | Mandarin Chinese | Requires `pip install misaki[zh]` |
| `e` | Spanish | Requires `pip install misaki` |
| `f` | French | Requires `pip install misaki` |
### CSM (Voice Cloning)
Clone any voice using a reference audio sample:
```bash
mlx_audio.tts.generate \
--model mlx-community/csm-1b \
--text "Hello from Sesame." \
--ref_audio ./reference_voice.wav \
--play
```
### Whisper STT
```python
from mlx_audio.stt.generate import generate_transcription
result = generate_transcription(
model="mlx-community/whisper-large-v3-turbo-asr-fp16",
audio="audio.wav",
)
print(result.text)
```
### Qwen3-ASR & ForcedAligner
Alibaba's multilingual speech models for transcription and word-level alignment.
```python
from mlx_audio.stt import load
# Speech recognition
model = load("mlx-community/Qwen3-ASR-0.6B-8bit")
result = model.generate("audio.wav", language="English")
print(result.text)
# Word-level forced alignment
aligner = load("mlx-community/Qwen3-ForcedAligner-0.6B-8bit")
result = aligner.generate("audio.wav", text="I have a dream", language="English")
for item in result:
print(f"[{item.start_time:.2f}s - {item.end_time:.2f}s] {item.text}")
```
See the [Qwen3-ASR README](mlx_audio/stt/models/qwen3_asr/README.md) for CLI usage, all models, and more examples.
### VibeVoice-ASR
Microsoft's 9B parameter speech-to-text model with speaker diarization and timestamps. Supports long-form audio (up to 60 minutes) and outputs structured JSON.
```python
from mlx_audio.stt.utils import load
model = load("mlx-community/VibeVoice-ASR-bf16")
# Basic transcription
result = model.generate(audio="meeting.wav", max_tokens=8192, temperature=0.0)
print(result.text)
# [{"Start":0,"End":5.2,"Speaker":0,"Content":"Hello everyone, let's begin."},
# {"Start":5.5,"End":9.8,"Speaker":1,"Content":"Thanks for joining today."}]
# Access parsed segments
for seg in result.segments:
print(f"[{seg['start_time']:.1f}-{seg['end_time']:.1f}] Speaker {seg['speaker_id']}: {seg['text']}")
```
**Streaming transcription:**
```python
# Stream tokens as they are generated
for text in model.stream_transcribe(audio="speech.wav", max_tokens=4096):
print(text, end="", flush=True)
```
**With context (hotwords/metadata):**
```python
result = model.generate(
audio="technical_talk.wav",
context="MLX, Apple Silicon, PyTorch, Transformer",
max_tokens=8192,
temperature=0.0,
)
```
**CLI usage:**
```bash
# Basic transcription
python -m mlx_audio.stt.generate \
--model mlx-community/VibeVoice-ASR-bf16 \
--audio meeting.wav \
--output-path output \
--format json \
--max-tokens 8192 \
--verbose
# With context/hotwords
python -m mlx_audio.stt.generate \
--model mlx-community/VibeVoice-ASR-bf16 \
--audio technical_talk.wav \
--output-path output \
--format json \
--max-tokens 8192 \
--context "MLX, Apple Silicon, PyTorch, Transformer" \
--verbose
```
### Parakeet (Multilingual STT)
NVIDIA's high-accuracy speech-to-text model. Parakeet v3 supports 25 European languages.
```python
from mlx_audio.stt.utils import load
# Load the multilingual v3 model
model = load("mlx-community/parakeet-tdt-0.6b-v3")
# Transcribe audio
result = model.generate("audio.wav")
print(f"Text: {result.text}")
# Access word-level timestamps
for sentence in result.sentences:
print(f"[{sentence.start:.2f}s - {sentence.end:.2f}s] {sentence.text}")
```
**Streaming transcription:**
```python
for chunk in model.generate("long_audio.wav", stream=True):
print(chunk.text, end="", flush=True)
```
**Supported languages (v3):**
Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish, Russian, Ukrainian
**CLI usage:**
```bash
python -m mlx_audio.stt.generate \
--model mlx-community/parakeet-tdt-0.6b-v3 \
--audio speech.wav \
--output-path output \
--format json \
--verbose
```
### KugelAudio
SOTA open-source 7B TTS model for 24 European languages, based on Microsoft VibeVoice.
Uses a hybrid AR + Diffusion architecture (Qwen2.5 LM + SDE-DPM-Solver++ diffusion head + VAE decoder).
```python
from mlx_audio.tts.utils import load_model
model = load_model("kugelaudio/kugelaudio-0-open")
for result in model.generate(
text="Hello, welcome to MLX-Audio!",
cfg_scale=3.0, # Classifier-free guidance (1.0=fast, 3.0=quality)
ddpm_steps=10, # Diffusion steps (5=fast, 10=balanced, 20=max quality)
):
audio = result.audio # mx.array, 24kHz
```
The model loads directly from HuggingFace (weights are remapped automatically via `sanitize()`).
To quantize or save in a pre-converted format:
```bash
python -m mlx_audio.convert \
--hf-path kugelaudio/kugelaudio-0-open \
--mlx-path ./kugelaudio-0-open-bf16 \
--dtype bfloat16
```
**Supported languages (24):** English, German, French, Spanish, Italian, Portuguese, Dutch, Polish, Russian, Ukrainian, Czech, Romanian, Hungarian, Swedish, Danish, Finnish, Norwegian, Greek, Bulgarian, Slovak, Croatian, Serbian, Turkish
> **Note:** Requires ~17GB memory (7B params in bfloat16).
> Pre-encoded voice presets (voice cloning) are not yet available in the upstream model — the model generates speech with a default voice.
### LongCat-AudioDiT
SOTA diffusion-based TTS operating in the waveform latent space. Uses Conditional Flow Matching with a DiT backbone and WAV-VAE codec at 24kHz. Supports zero-shot voice cloning.
```python
from mlx_audio.tts.utils import load
model = load("mlx-community/LongCat-AudioDiT-1B-bf16")
# Zero-shot TTS
result = next(model.generate("Hello, this is a test of AudioDiT."))
audio = result.audio # mx.array, 24kHz
# Voice cloning (use "apg" guidance for best similarity)
result = next(model.generate(
text="Today is warm turning to rain.",
ref_audio="reference.wav",
ref_text="Transcript of the reference audio.",
guidance_method="apg",
cfg_strength=4.0,
steps=16,
))
```
See the [LongCat-AudioDiT README](mlx_audio/tts/models/longcat_audiodit/README.md) for all parameters and CLI usage.
### Voxtral TTS
Mistral's 4B multilingual text-to-speech with 20 voice presets across 9 languages.
```python
from mlx_audio.tts.utils import load
model = load("mlx-community/Voxtral-4B-TTS-2603-mlx-bf16")
for result in model.generate(text="Hello, how are you today?", voice="casual_male"):
print(result.audio_duration)
```
Voices: `casual_male`, `casual_female`, `cheerful_female`, `neutral_male`, `neutral_female`, `fr_male`, `fr_female`, `es_male`, `es_female`, `de_male`, `de_female`, `it_male`, `it_female`, `pt_male`, `pt_female`, `nl_male`, `nl_female`, `ar_male`, `hi_male`, `hi_female`
### Voxtral Realtime
Mistral's 4B parameter streaming speech-to-text model, optimized for low-latency transcription.
Available variants: [4bit](https://huggingface.co/mlx-community/Voxtral-Mini-4B-Realtime-2602-4bit) (smaller/faster) | [fp16](https://huggingface.co/mlx-community/Voxtral-Mini-4B-Realtime-2602-fp16) (full precision)
```python
from mlx_audio.stt.utils import load
# Use 4bit for faster inference, fp16 for full precision
model = load("mlx-community/Voxtral-Mini-4B-Realtime-2602-4bit")
# Transcribe audio
result = model.generate("audio.wav")
print(result.text)
# Streaming transcription
for chunk in model.generate("audio.wav", stream=True):
print(chunk, end="", flush=True)
# Adjust transcription delay (lower = faster but less accurate)
result = model.generate("audio.wav", transcription_delay_ms=240)
```
### MedASR (Medical Transcription)
Specialized model for medical terms and dictation.
```python
from mlx_audio.stt.utils import load, transcribe
model = load("mlx-community/medasr")
result = transcribe("medical_dictation.wav", model=model)
print(result["text"])
```
**Live Transcription Example:**
```bash
# Continuous live transcription with VAD
python examples/medasr_live.py
```
### SAM-Audio (Source Separation)
Separate specific sounds from audio using text prompts:
```python
from mlx_audio.sts import SAMAudio, SAMAudioProcessor, save_audio
model = SAMAudio.from_pretrained("mlx-community/sam-audio-large")
processor = SAMAudioProcessor.from_pretrained("mlx-community/sam-audio-large")
batch = processor(
descriptions=["A person speaking"],
audios=["mixed_audio.wav"],
)
result = model.separate_long(
batch.audios,
descriptions=batch.descriptions,
anchors=batch.anchor_ids,
chunk_seconds=10.0,
overlap_seconds=3.0,
ode_opt={"method": "midpoint", "step_size": 2/32},
)
save_audio(result.target[0], "voice.wav")
save_audio(result.residual[0], "background.wav")
```
### MossFormer2 (Speech Enhancement)
Remove noise from speech recordings:
```python
from mlx_audio.sts import MossFormer2SEModel, save_audio
model = MossFormer2SEModel.from_pretrained("starkdmi/MossFormer2_SE_48K_MLX")
enhanced = model.enhance("noisy_speech.wav")
save_audio(enhanced, "clean.wav", 48000)
```Web Interface & API Server
MLX-Audio includes a modern web interface and OpenAI-compatible API.
Starting the Server
# Start API server
mlx_audio.server --host 0.0.0.0 --port 8000
# Start web UI (in another terminal)
cd mlx_audio/ui
npm install && npm run devAPI Endpoints
Text-to-Speech (OpenAI-compatible):
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model": "mlx-community/Kokoro-82M-bf16", "input": "Hello!", "voice": "af_heart"}' \
--output speech.wavSpeech-to-Text:
curl -X POST http://localhost:8000/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "model=mlx-community/whisper-large-v3-turbo-asr-fp16"Quantization
Reduce model size and improve performance with quantization using the convert script:
# Convert and quantize to 4-bit
python -m mlx_audio.convert \
--hf-path prince-canuma/Kokoro-82M \
--mlx-path ./Kokoro-82M-4bit \
--quantize \
--q-bits 4 \
--upload-repo username/Kokoro-82M-4bit (optional: if you want to upload the model to Hugging Face)
# Convert with MXFP4 quantization
python -m mlx_audio.convert \
--hf-path prince-canuma/Kokoro-82M \
--mlx-path ./Kokoro-82M-mxfp4 \
--quantize \
--q-mode mxfp4
# Convert with specific dtype (bfloat16)
python -m mlx_audio.convert \
--hf-path prince-canuma/Kokoro-82M \
--mlx-path ./Kokoro-82M-bf16 \
--dtype bfloat16 \
--upload-repo username/Kokoro-82M-bf16 (optional: if you want to upload the model to Hugging Face)Options: | Flag | Description | |------|-------------| | --hf-path | Source Hugging Face model or local path | | --mlx-path | Output directory for converted model | | -q, --quantize | Enable quantization | | --q-bits | Bits per weight (optional, defaults depend on --q-mode) | | --q-group-size | Group size for quantization (optional, defaults depend on --q-mode) | | --q-mode | Quantization mode: affine, mxfp4, mxfp8, nvfp4 | | --dtype | Weight dtype: float16, bfloat16, float32 | | --upload-repo | Upload converted model to HF Hub |
Swift
Looking for Swift/iOS support? Check out mlx-audio-swift for on-device TTS using MLX on macOS and iOS.
Requirements
- Python 3.10+
- Apple Silicon Mac (M1/M2/M3/M4)
- MLX framework
- ffmpeg (required for MP3/FLAC/OGG/Opus/Vorbis audio encoding)
Installing ffmpeg
ffmpeg is required for saving audio in MP3, FLAC, OGG, Opus, or Vorbis format. Install it using:
# macOS (using Homebrew)
brew install ffmpeg
# Ubuntu/Debian
sudo apt install ffmpegWAV format works without ffmpeg.
License
Citation
@misc{mlx-audio,
author = {Canuma, Prince},
title = {MLX Audio},
year = {2025},
howpublished = {\url{https://github.com/Blaizzy/mlx-audio}},
note = {Audio processing library for Apple Silicon with TTS, STT, and STS capabilities.}
}Acknowledgements
- Apple MLX Team for the MLX framework
Package Metadata
Repository: Blaizzy/mlx-audio
Stars: 6800
Forks: 560
Open issues: 78
Default branch: main
Primary language: python
License: MIT
Topics: apple-silicon, audio-processing, mlx, multimodal, speech-recognition, speech-synthesis, speech-to-text, text-to-speech, transformers
README: README.md