The Artist & Speaker: Voice (TTS/STT) & Image Gen

Welcome back to the 67 AI Lab! We are on Day 4 of our 30-day journey to build the ultimate local AI agent.

Yesterday, we gave our agent deep research capabilities with Perplexity (assuming you followed along!). Today, we’re making it more human. We’re giving it a voice, ears, and an imagination.

Text interfaces are efficient, but talking to your room and having it reply—or asking it to visualize an idea instantly—is where the magic happens.

1. The Speaker: Text-to-Speech (TTS)

For a truly immersive experience, we need high-quality TTS. OpenClaw supports several providers, but ElevenLabs remains the gold standard for realistic voices.

Configuration

To enable TTS, you generally need to add your API key to your environment or TOOLS.md (depending on your specific OpenClaw setup).

# In your .env or export
export ELEVENLABS_API_KEY="your-key-here"

Once configured, your agent can speak using the tts tool. You can even assign different voices to different sub-agents or moods!

# Test it out
openclaw tts "I am now fully operational."

2. The Listener: Speech-to-Text (STT)

Hearing is just as important as speaking. We use Whisper (via OpenAI or Deepgram) for fast, accurate transcription.

When running on a Raspberry Pi 5, you have two choices:

Cloud STT: Sends audio to OpenAI/Deepgram. Fast, high quality, low CPU usage.
Local STT: Run whisper.cpp locally. Private, free, but eats CPU cycles.

For this setup, we’re sticking with Cloud STT to keep our Pi responsive for other tasks.

3. The Artist: Image Generation

Now for the fun part: giving your agent an imagination. We’ll use the HuggingFace Inference API to run Flux-Schnell, a lightning-fast image model.

The Skill

We’re using a custom Python skill script located in skills/huggingface/gen_image.py. This script sends a prompt to HuggingFace and saves the result.

Here is the core of how we invoke it:

import requests

API_URL = "https://api-inference.huggingface.co/models/black-forest-labs/FLUX.1-schnell"
headers = {"Authorization": f"Bearer {hf_token}"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.content

image_bytes = query({
    "inputs": "A cyberpunk cat hacking a computer, 8k",
})
# ... save image_bytes to file ...

Usage

With this skill enabled, you can simply ask your agent:

“Generate an image of a futuristic city on Mars.”

And it will execute the tool, saving the image to your static files directory. In fact, the cover image for this very post was generated by OpenClaw using this exact skill!

4. Putting It All Together

Imagine the workflow:

You speak to your Pi: “Design a logo for a coffee shop on the moon.”
STT transcribes your voice.
The LLM processes the request and calls the gen_image tool.
The image is generated.
The LLM replies via TTS: “Here is a concept for the Lunar Latte logo.”

This is the power of multimodal agents. We aren’t just processing text anymore; we are interacting with media.

What’s Next?

Tomorrow, on Day 5, we’re cutting the cord. We’ll explore Going Private by running local LLMs with Ollama directly on the Raspberry Pi 5. No more API bills!

See you then.

1. The Speaker: Text-to-Speech (TTS)#

Configuration#

2. The Listener: Speech-to-Text (STT)#

3. The Artist: Image Generation#

The Skill#

Usage#

4. Putting It All Together#

What’s Next?#