Welcome back to the 67 AI Lab! We are on Day 4 of our 30-day journey to build the ultimate local AI agent.
Yesterday, we gave our agent deep research capabilities with Perplexity (assuming you followed along!). Today, we’re making it more human. We’re giving it a voice, ears, and an imagination.
Text interfaces are efficient, but talking to your room and having it reply—or asking it to visualize an idea instantly—is where the magic happens.
1. The Speaker: Text-to-Speech (TTS)
For a truly immersive experience, we need high-quality TTS. OpenClaw supports several providers, but ElevenLabs remains the gold standard for realistic voices.
Configuration
To enable TTS, you generally need to add your API key to your environment or TOOLS.md (depending on your specific OpenClaw setup).
# In your .env or export
export ELEVENLABS_API_KEY="your-key-here"
Once configured, your agent can speak using the tts tool. You can even assign different voices to different sub-agents or moods!
# Test it out
openclaw tts "I am now fully operational."
2. The Listener: Speech-to-Text (STT)
Hearing is just as important as speaking. We use Whisper (via OpenAI or Deepgram) for fast, accurate transcription.
When running on a Raspberry Pi 5, you have two choices:
- Cloud STT: Sends audio to OpenAI/Deepgram. Fast, high quality, low CPU usage.
- Local STT: Run
whisper.cpplocally. Private, free, but eats CPU cycles.
For this setup, we’re sticking with Cloud STT to keep our Pi responsive for other tasks.
3. The Artist: Image Generation
Now for the fun part: giving your agent an imagination. We’ll use the HuggingFace Inference API to run Flux-Schnell, a lightning-fast image model.
The Skill
We’re using a custom Python skill script located in skills/huggingface/gen_image.py. This script sends a prompt to HuggingFace and saves the result.
Here is the core of how we invoke it:
import requests
API_URL = "https://api-inference.huggingface.co/models/black-forest-labs/FLUX.1-schnell"
headers = {"Authorization": f"Bearer {hf_token}"}
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.content
image_bytes = query({
"inputs": "A cyberpunk cat hacking a computer, 8k",
})
# ... save image_bytes to file ...
Usage
With this skill enabled, you can simply ask your agent:
“Generate an image of a futuristic city on Mars.”
And it will execute the tool, saving the image to your static files directory. In fact, the cover image for this very post was generated by OpenClaw using this exact skill!
4. Putting It All Together
Imagine the workflow:
- You speak to your Pi: “Design a logo for a coffee shop on the moon.”
- STT transcribes your voice.
- The LLM processes the request and calls the
gen_imagetool. - The image is generated.
- The LLM replies via TTS: “Here is a concept for the Lunar Latte logo.”
This is the power of multimodal agents. We aren’t just processing text anymore; we are interacting with media.
What’s Next?
Tomorrow, on Day 5, we’re cutting the cord. We’ll explore Going Private by running local LLMs with Ollama directly on the Raspberry Pi 5. No more API bills!
See you then.