Audio/Video Processing with Speech-to-Text, Text-to-Speech, and Multimedia RAG - Transform voice into text, text into voice, and index multimedia content for AI-powered search
v4.0.0
Architecture Overview
Media Worker Processing Pipeline
Voice Chat Pipeline
Complete voice-in, voice-out conversation flow with your AI models.
1. Audio Input
Microphone or audio file
→
2. Transcribe
Whisper STT
→
3. Process
LLM Inference
→
4. Synthesize
Piper/ElevenLabs
→
5. Audio Out
Speaker output
# Voice chat with a single API call
curl -X POST http://media-worker:8894/api/v1/voice/chat \
-H 'Content-Type: application/json' \
-d '{
"audio_data": "'$(base64 -i recording.wav)'",
"model": "llama3.2:3b",
"voice_id": "en_US-amy-medium",
"response_format": "mp3"
}'
# Response includes both text and audio
{
"text_response": "I can help you with that...",
"audio_data": "<base64-encoded-mp3>",
"transcription": "What's the weather like?",
"duration_ms": 2340
}
Core Capabilities
Speech-to-Text
Free+
Convert spoken audio into accurate text transcriptions.
Whisper.cpp local GPU acceleration
OpenAI Whisper API fallback
99+ languages supported
Real-time streaming transcription
Speaker diarization (Pro+)
Text-to-Speech
Free+
Generate natural-sounding speech from text.
Piper local voices (100+ voices)
ElevenLabs HD voices
OpenAI TTS integration
SSML support for prosody control
Voice cloning (Enterprise)
Video Processing
Standard+
Extract and process video content for AI workflows.
FFmpeg transcoding
Frame extraction at intervals
Scene detection
Audio track extraction
Video transcription pipeline
Multimedia RAG
Standard+
Index and search audio/video content semantically.
Automatic transcription indexing
Timestamp-aligned search
Multi-modal embeddings
Integration with Data Worker
Citation with timestamps
STT Backends
Whisper.cpp
Local GPU-accelerated transcription with GGML models
Local
OpenAI Whisper
Cloud API for high-accuracy transcription
Cloud
Faster-Whisper
CTranslate2 optimized inference
Local
Azure Speech
Microsoft Azure Cognitive Services
Cloud
# Transcribe audio with Whisper.cpp
curl -X POST http://media-worker:8894/api/v1/stt/transcribe \
-H 'Content-Type: application/json' \
-d '{
"audio_url": "https://example.com/meeting.mp3",
"language": "en",
"model": "large-v3",
"options": {
"word_timestamps": true,
"diarize": true
}
}'
# Response with word-level timestamps
{
"text": "Hello, welcome to the meeting...",
"segments": [
{
"start": 0.0,
"end": 2.5,
"text": "Hello, welcome to the meeting",
"speaker": "SPEAKER_00"
}
],
"language": "en",
"duration": 3600.5
}
TTS Backends
Piper
Fast local neural TTS with 100+ voices
Local
ElevenLabs
Premium HD voices with emotion control
Cloud
OpenAI TTS
Natural sounding voices via API
Cloud
Coqui TTS
Open-source multi-lingual TTS
Local
# Synthesize speech with Piper
curl -X POST http://media-worker:8894/api/v1/tts/synthesize \
-H 'Content-Type: application/json' \
-d '{
"text": "Welcome to Eldric, your AI-powered assistant.",
"voice_id": "en_US-amy-medium",
"output_format": "mp3",
"speed": 1.0
}' --output speech.mp3
# List available voices
curl http://media-worker:8894/api/v1/tts/voices
# Stream audio in real-time (SSE)
curl -N http://media-worker:8894/api/v1/tts/stream \
-H 'Content-Type: application/json' \
-d '{"text": "This is streamed audio...", "voice_id": "en_GB-alan-medium"}'
REST API
Method
Endpoint
Description
GET
/health
Health check with backend status
GET
/dashboard
Web dashboard for monitoring
GET
/api/v1/media/info
Worker capabilities and models
POST
/api/v1/stt/transcribe
Transcribe audio file or URL
POST
/api/v1/stt/stream
Streaming transcription (SSE)
GET
/api/v1/stt/models
List available STT models
POST
/api/v1/tts/synthesize
Generate speech from text
POST
/api/v1/tts/stream
Stream audio output (SSE)
GET
/api/v1/tts/voices
List available TTS voices
POST
/api/v1/video/transcribe
Extract and transcribe video audio
POST
/api/v1/video/extract-frames
Extract frames at intervals
POST
/api/v1/voice/chat
Voice-in, voice-out conversation
POST
/api/v1/rag/ingest-audio
Index audio for RAG search
POST
/api/v1/rag/ingest-video
Index video for RAG search
Use Cases
1. Meeting Transcription & Summarization
Automatically transcribe meetings, generate summaries, and make content searchable.
1 Upload recording
→
2 STT transcription
→
3 Speaker diarization
→
4 LLM summary
→
5 RAG index
2. Voice-Enabled AI Assistant
Build hands-free AI interactions with natural voice input and output.
Index your video library for AI-powered search and Q&A with timestamp citations.
1 Ingest videos
→
2 Extract audio
→
3 Transcribe
→
4 Vector embed
→
5 Semantic search
4. Podcast Production Pipeline
Automate podcast post-production with transcription, show notes, and audio enhancement.
# Full podcast processing pipeline
curl -X POST http://media-worker:8894/api/v1/video/transcribe \
-d '{"video_url": "/recordings/episode-42.mp4"}'
# Generate show notes with LLM
curl -X POST http://inference:8890/v1/chat/completions \
-d '{
"model": "llama3.2:3b",
"messages": [{"role": "user", "content": "Generate show notes: ...transcript..."}]
}'
# Create audio clips for social media
curl -X POST http://media-worker:8894/api/v1/tts/synthesize \
-d '{"text": "Key quote from the episode...", "voice_id": "en_US-amy-medium"}'
Quick Start
1. Start Media Worker
# Run with local Whisper and Piper
./eldric-mediad --port 8894 \
--stt-backend whisper_cpp \
--whisper-model /models/ggml-large-v3.bin \
--whisper-gpu \
--tts-backend piper \
--piper-models /models/piper
# Register with controller
./eldric-mediad --controller http://controller:8880 \
--data-workers http://data-worker:8892
2. Test Transcription
# Health check
curl http://localhost:8894/health
# Transcribe a file
curl -X POST http://localhost:8894/api/v1/stt/transcribe \
-H 'Content-Type: application/json' \
-d '{"audio_url": "/path/to/audio.mp3"}'
# List voices
curl http://localhost:8894/api/v1/tts/voices
3. Voice Chat Example
# Record audio and send to voice chat endpoint
AUDIO_B64=$(base64 -i question.wav)
curl -X POST http://localhost:8894/api/v1/voice/chat \
-H 'Content-Type: application/json' \
-d "{
\"audio_data\": \"$AUDIO_B64\",
\"model\": \"llama3.2:3b\",
\"voice_id\": \"en_US-amy-medium\"
}" | jq -r '.audio_data' | base64 -d > response.mp3
# Play the response
afplay response.mp3