Skip to main content
Public
README.md 2.07 KB

MLX-Audio

The best audio processing library built on Apple's MLX framework, providing fast and efficient text-to-speech (TTS), speech-to-text (STT), and speech-to-speech (STS) on Apple Silicon.

Features

  • Fast inference optimized for Apple Silicon (M series chips)
  • Multiple model architectures for TTS, STT, and STS
  • Multilingual support across models
  • Voice customization and cloning capabilities
  • Adjustable speech speed control
  • Interactive web interface with 3D audio visualization
  • OpenAI-compatible REST API
  • Quantization support (3-bit, 4-bit, 6-bit, 8-bit, and more) for optimized performance
  • Swift package for iOS/macOS integration
  • Installation

    Using pip

    BASH
    1
    pip install mlx-audio

    Using uv to install only the command line tools

    Latest release from pypi:

    BASH
    1
    uv tool install --force mlx-audio --prerelease=allow

    Latest code from github:

    BASH
    1
    uv tool install --force git+https://github.com/Blaizzy/mlx-audio.git --prerelease=allow

    For development or web interface:

    BASH
    1
    2
    git clone https://github.com/Blaizzy/mlx-audio.git
    

    cd mlx-audio pip install -e ".[dev]"

    Quick Start

    Command Line

    BASH
    1
    2
    3
    4
    5
    # Basic TTS generation
    

    mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text "Hello, world!" --lang_code a

    # With voice selection and speed adjustment mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text "Hello!" --voice af_heart --speed 1.2 --lang_code a

    # Play audio immediately mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text "Hello!" --play --lang_code a

    # Save to a specific directory mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text "Hello!" --output_path ./my_audio --lang_code a

    Python API

    PYTHON
    1
    2
    3
    from mlx_audio.tts.utils import load_model
    

    # Load model model = load_model("mlx-community/Kokoro-82M-bf16")

    # Generate speech for result in model.generate("Hello from MLX-Audio!", voice="af_heart"): print(f"Generated {result.audio.shape[0]} samples") # result.audio contains the waveform as mx.array

    Supported Models

    Text-to-Speech (TTS)

    | Model | Description | Languages | Repo | |-------|-------------|-----------|------| | Kokoro | Fast, high-quality multilingual TTS | EN, JA, ZH, FR, ES, IT, PT, HI | mlx-community/Kokoro-82M-bf16 | | Qwen3-TTS | Alibaba's multilingual TTS with voice design | ZH, EN, JA, KO, + more | mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16 | | CSM | Conversational Speech Model with voice cloning | EN | mlx-community/csm-1b | | Dia | Dialogue-focused TTS | EN | mlx-community/Dia-1.6B-bf16 | | OuteTTS | Efficient TTS model | EN | mlx-community/OuteTTS-0.2-500M | | Spark | SparkTTS model | EN, ZH | mlx-community/SparkTTS-0.5B-bf16 | | Chatterbox | Expressive multilingual TTS | EN, ES, FR, DE, IT, PT, PL, TR, RU, NL, CS, AR, ZH, JA, HU, KO | mlx-community/Chatterbox-bf16 | | Soprano | High-quality TTS | EN | mlx-community/Soprano-bf16 |

    Speech-to-Text (STT)

    | Model | Description | Languages | Repo | |-------|-------------|-----------|------| | Whisper | OpenAI's robust STT model | 99+ languages | mlx-community/whisper-large-v3-turbo-asr-fp16 | | Parakeet | NVIDIA's accurate STT | EN | mlx-community/parakeet-tdt-0.6b-v2 | | Voxtral | Mistral's speech model | Multiple | mlx-community/Voxtral-Mini-3B-2507-bf16 | | VibeVoice-ASR | Microsoft's 9B ASR with diarization & timestamps | Multiple | mlx-community/VibeVoice-ASR-bf16 |

    Speech-to-Speech (STS)

    | Model | Description | Use Case | Repo | |-------|-------------|----------|------| | SAM-Audio | Text-guided source separation | Extract specific sounds | mlx-community/sam-audio-large | | Liquid2.5-Audio* | Speech-to-Speech, Text-to-Speech and Speech-to-Text | Speech interactions | mlx-community/LFM2.5-Audio-1.5B-8bit | MossFormer2 SE | Speech enhancement | Noise removal | starkdmi/MossFormer2_SE_48K_MLX |

    Model Examples

    Kokoro TTS

    Kokoro is a fast, multilingual TTS model with 54 voice presets.

    PYTHON
    1
    2
    3
    from mlx_audio.tts.utils import load_model
    

    model = load_model("mlx-community/Kokoro-82M-bf16")

    # Generate with different voices for result in model.generate( text="Welcome to MLX-Audio!", voice="af_heart", # American female speed=1.0, lang_code="a" # American English ): audio = result.audio

    Available Voices:

  • American English: af_heart, af_bella, af_nova, af_sky, am_adam, am_echo, etc.
  • British English: bf_alice, bf_emma, bm_daniel, bm_george, etc.
  • Japanese: jf_alpha, jm_kumo, etc.
  • Chinese: zf_xiaobei, zm_yunxi, etc.
  • Language Codes: | Code | Language | Note | |------|----------|------| | a | American English | Default | | b | British English | | | j | Japanese | Requires pip install misaki[ja] | | z | Mandarin Chinese | Requires pip install misaki[zh] | | e | Spanish | | | f | French | |

    Qwen3-TTS

    Alibaba's state-of-the-art multilingual TTS with voice cloning, emotion control, and voice design capabilities.

    PYTHON
    1
    2
    3
    from mlx_audio.tts.utils import load_model
    

    model = load_model("mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16") results = list(model.generate( text="Hello, welcome to MLX-Audio!", voice="Chelsie", language="English", ))

    audio = results[0].audio # mx.array

    See the Qwen3-TTS README for voice cloning, CustomVoice, VoiceDesign, and all available models.

    CSM (Voice Cloning)

    Clone any voice using a reference audio sample:

    BASH
    1
    2
    mlx_audio.tts.generate \
    

    --model mlx-community/csm-1b \ --text "Hello from Sesame." \ --ref_audio ./reference_voice.wav \ --play

    Whisper STT

    PYTHON
    1
    2
    from mlx_audio.stt.utils import load_model, transcribe
    

    model = load_model("mlx-community/whisper-large-v3-turbo-asr-fp16") result = transcribe("audio.wav", model=model) print(result["text"])

    VibeVoice-ASR

    Microsoft's 9B parameter speech-to-text model with speaker diarization and timestamps. Supports long-form audio (up to 60 minutes) and outputs structured JSON.

    PYTHON
    1
    2
    3
    4
    from mlx_audio.stt.utils import load
    

    model = load("mlx-community/VibeVoice-ASR-bf16")

    # Basic transcription result = model.generate(audio="meeting.wav", max_tokens=8192, temperature=0.0) print(result.text) # [{"Start":0,"End":5.2,"Speaker":0,"Content":"Hello everyone, let's begin."}, # {"Start":5.5,"End":9.8,"Speaker":1,"Content":"Thanks for joining today."}]

    # Access parsed segments for seg in result.segments: print(f"[{seg['start_time']:.1f}-{seg['end_time']:.1f}] Speaker {seg['speaker_id']}: {seg['text']}")

    Streaming transcription:

    PYTHON
    1
    2
    # Stream tokens as they are generated
    

    for text in model.stream_transcribe(audio="speech.wav", max_tokens=4096): print(text, end="", flush=True)

    With context (hotwords/metadata):

    PYTHON
    1
    2
    result = model.generate(
    

    audio="technical_talk.wav", context="MLX, Apple Silicon, PyTorch, Transformer", max_tokens=8192, temperature=0.0, )

    CLI usage:

    BASH
    1
    2
    3
    # Basic transcription
    

    python -m mlx_audio.stt.generate \ --model mlx-community/VibeVoice-ASR-bf16 \ --audio meeting.wav \ --output-path output \ --format json \ --max-tokens 8192 \ --verbose

    # With context/hotwords python -m mlx_audio.stt.generate \ --model mlx-community/VibeVoice-ASR-bf16 \ --audio technical_talk.wav \ --output-path output \ --format json \ --max-tokens 8192 \ --context "MLX, Apple Silicon, PyTorch, Transformer" \ --verbose

    SAM-Audio (Source Separation)

    Separate specific sounds from audio using text prompts:

    PYTHON
    1
    2
    3
    4
    5
    from mlx_audio.sts import SAMAudio, SAMAudioProcessor, save_audio
    

    model = SAMAudio.from_pretrained("mlx-community/sam-audio-large") processor = SAMAudioProcessor.from_pretrained("mlx-community/sam-audio-large")

    batch = processor( descriptions=["A person speaking"], audios=["mixed_audio.wav"], )

    result = model.separate_long( batch.audios, descriptions=batch.descriptions, anchors=batch.anchor_ids, chunk_seconds=10.0, overlap_seconds=3.0, ode_opt={"method": "midpoint", "step_size": 2/32}, )

    save_audio(result.target[0], "voice.wav") save_audio(result.residual[0], "background.wav")

    MossFormer2 (Speech Enhancement)

    Remove noise from speech recordings:

    PYTHON
    1
    2
    from mlx_audio.sts import MossFormer2SEModel, save_audio
    

    model = MossFormer2SEModel.from_pretrained("starkdmi/MossFormer2_SE_48K_MLX") enhanced = model.enhance("noisy_speech.wav") save_audio(enhanced, "clean.wav", 48000)

    Web Interface & API Server

    MLX-Audio includes a modern web interface and OpenAI-compatible API.

    Starting the Server

    BASH
    1
    2
    3
    # Start API server
    

    mlx_audio.server --host 0.0.0.0 --port 8000

    # Start web UI (in another terminal) cd mlx_audio/ui npm install && npm run dev

    API Endpoints

    Text-to-Speech (OpenAI-compatible):

    BASH
    1
    2
    curl -X POST http://localhost:8000/v1/audio/speech \
    

    -H "Content-Type: application/json" \ -d '{"model": "mlx-community/Kokoro-82M-bf16", "input": "Hello!", "voice": "af_heart"}' \ --output speech.wav

    Speech-to-Text:

    BASH
    1
    2
    curl -X POST http://localhost:8000/v1/audio/transcriptions \
    

    -F "[email protected]" \ -F "model=mlx-community/whisper-large-v3-turbo-asr-fp16"

    Quantization

  • MLX
  • Python 3.8+
  • Apple Silicon Mac (for optimal performance)
  • For the web interface and API:
  • FastAPI
  • Uvicorn
  • Swift

    Looking for Swift/iOS support? Check out mlx-audio-swift for on-device TTS using MLX on macOS and iOS. Reduce model size and improve performance with quantization using the convert script:

    BASH
    1
    2
    3
    # Convert and quantize to 4-bit
    

    python -m mlx_audio.convert \ --hf-path prince-canuma/Kokoro-82M \ --mlx-path ./Kokoro-82M-4bit \ --quantize \ --q-bits 4 \ --upload-repo username/Kokoro-82M-4bit (optional: if you want to upload the model to Hugging Face)

    # Convert with specific dtype (bfloat16) python -m mlx_audio.convert \ --hf-path prince-canuma/Kokoro-82M \ --mlx-path ./Kokoro-82M-bf16 \ --dtype bfloat16 \ --upload-repo username/Kokoro-82M-bf16 (optional: if you want to upload the model to Hugging Face)

    Options: | Flag | Description | |------|-------------| | --hf-path | Source Hugging Face model or local path | | --mlx-path | Output directory for converted model | | -q, --quantize | Enable quantization | | --q-bits | Bits per weight (4, 6, or 8) | | --q-group-size | Group size for quantization (default: 64) | | --dtype | Weight dtype: float16, bfloat16, float32 | | --upload-repo | Upload converted model to HF Hub |

    Requirements

  • Python 3.10+
  • Apple Silicon Mac (M1/M2/M3/M4)
  • MLX framework
  • ffmpeg (required for MP3/FLAC audio encoding)
  • Installing ffmpeg

    ffmpeg is required for saving audio in MP3 or FLAC format. Install it using:

    BASH
    1
    2
    3
    # macOS (using Homebrew)
    

    brew install ffmpeg

    # Ubuntu/Debian sudo apt install ffmpeg

    WAV format works without ffmpeg.

    License

    MIT License

    Citation

    BIBTEX
    1
    2
    @misc{mlx-audio,
    

    author = {Canuma, Prince}, title = {MLX Audio}, year = {2025}, howpublished = {\url{https://github.com/Blaizzy/mlx-audio}}, note = {Audio processing library for Apple Silicon with TTS, STT, and STS capabilities.} }

    Acknowledgements

  • Apple MLX Team for the MLX framework
  • About

    MLX-Audio is an open-source audio AI framework for fast text-to-speech, speech-to-text, and speech-to-speech processing, optimized for Apple Silicon using MLX.


    383 files
    88 folders
    7.11 MB total size
    0 open issues
    0 open pull requests
    0 watchers
    0 forks
    0 stars
    452 views
    Updated Jan 25, 2026
    Languages
    Python 95.1%
    TypeScript 4.4%
    YAML 0.2%
    Text 0.2%
    CSS 0.1%
    TOML 0.1%
    LICENSE 0.0%
    JavaScript 0.0%
    INI 0.0%
    __init__ 0.0%