How to Add Local Speech-to-Text to OpenClaw with Mellon

If you're using OpenClaw to build AI agents that handle voice messages, you need a speech-to-text backend. Most setups send audio to cloud APIs like OpenAI's Whisper — but what if you want it fast, private, and free?

Mellon runs a local Whisper model on your Mac and exposes an OpenAI-compatible API. That means OpenClaw can use it as a drop-in replacement — no API keys, no cloud, no per-minute billing. Your audio never leaves your device.

Available in Mellon v1.4.0+.

Setup (2 minutes)

Step 1: Install Mellon

Install Mellon on your Mac and launch it. The Whisper model downloads automatically on first launch. Book a call if you want help setting this up in a business environment.

Step 2: Enable the API Server

Open Mellon's settings and go to API Server. Toggle the server on. It starts on http://localhost:8765 by default.

Verify it's running:

curl http://localhost:8765/health
# {"status": "ok", "model_loaded": true}

Step 3: Configure OpenClaw

Add this to your openclaw.json config file:

{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "models": [{
          "provider": "openai",
          "model": "whisper-1",
          "baseUrl": "http://127.0.0.1:8765/v1"
        }]
      }
    }
  }
}

That's it. Restart OpenClaw and send a voice message — it'll be transcribed locally by Mellon.

What You Get

Mellon doesn't just run Whisper. When OpenClaw sends audio to the /v1/audio/transcriptions endpoint, it goes through the full transcription pipeline:

Whisper transcription — on-device, using Apple's Neural Engine for speed
Custom dictionary matching — your custom terms (product names, people, jargon) are phonetically matched and corrected
Medical dictionary — if enabled, medical terminology is automatically recognized
Spellcheck corrections — common Whisper mistakes are caught and fixed

For example, if you've added "ChronoCat" and "Mellon" to your custom dictionary, Whisper's output of "I updated chrono cat and opened melon" gets automatically corrected to "I updated ChronoCat and opened Mellon".

All Available Endpoints

Mellon exposes several endpoints depending on your needs:

POST /v1/audio/transcriptions

Recommended. OpenAI-compatible (multipart/form-data). Full pipeline: Whisper + spellcheck + custom dictionary.

curl -X POST http://localhost:8765/v1/audio/transcriptions \
  -F "file=@recording.wav" \
  -F "model=whisper-1"

# {"text": "I updated ChronoCat and opened Mellon."}

POST /transcribe-full

Raw audio body. Same full pipeline, but returns detailed correction data and timing.

curl -X POST http://localhost:8765/transcribe-full \
  --data-binary @recording.wav \
  -H "Content-Type: application/octet-stream"

# {
#   "success": true,
#   "text": "I updated ChronoCat and opened Mellon.",
#   "whisper_text": "I updated chrono cat and opened melon.",
#   "corrections": [
#     {"original": "chrono cat", "corrected": "ChronoCat", "source": "custom"}
#   ],
#   "timing": {"whisper_ms": 1024, "spellcheck_ms": 2, "total_ms": 1026}
# }

POST /transcribe

Raw audio body. Whisper only — no spellcheck or dictionary corrections.

curl -X POST http://localhost:8765/transcribe \
  --data-binary @recording.wav \
  -H "Content-Type: application/octet-stream"

# {"success": true, "text": "I updated chrono cat.", "duration_ms": 1025}

GET /health

Returns server status and whether the Whisper model is loaded.

Supported Audio Formats

All endpoints accept: WAV, MP3, M4A, FLAC, AIFF, OGG (OGG requires macOS 14+). Audio is automatically converted — no pre-processing needed.

Custom Dictionary Tips

To get the most out of the API, add your commonly used terms in Mellon → Settings → Dictionary:

Product names — brand names, app names, project codenames
People's names — colleagues, contacts, team members
Technical jargon — industry-specific terms Whisper might misspell
Medical terms — enable the medical dictionary toggle for healthcare terminology

These corrections apply automatically to every API transcription — no configuration per-request.

Privacy

The API server only listens on localhost. All processing happens on your Mac using Apple Silicon's Neural Engine. No audio data leaves your device, ever. No API keys, no accounts, no usage tracking.

Want us to set up an OpenClaw AI agent for your business? Book a free 30-min call — we handle the full setup and configuration.