Eldric Media Worker

Audio/Video Processing with Speech-to-Text, Text-to-Speech, and Multimedia RAG - Transform voice into text, text into voice, and index multimedia content for AI-powered search

v4.0.0

Architecture Overview

Media Worker Processing Pipeline
Media Worker eldric-mediad | Port 8894 Audio/Video Processing Hub Audio File .mp3 .wav .m4a Video File .mp4 .webm .mkv Speech-to-Text (STT) Whisper.cpp Local GPU OpenAI Whisper Cloud API Real-time Streaming WebSocket / SSE Text-to-Speech (TTS) Piper Local Voices ElevenLabs Cloud API OpenAI TTS HD Voice Synthesis Video Processing FFmpeg Transcode Frames Extraction Scenes Detect Voice Chat Pipeline Audio In STT LLM TTS Audio Out Multimedia RAG Index audio/video content for semantic search via Data Worker (Port 8892) Legend STT (Speech-to-Text) TTS (Text-to-Speech) Video Processing

Voice Chat Pipeline

Complete voice-in, voice-out conversation flow with your AI models.

1. Audio Input

Microphone or audio file

2. Transcribe

Whisper STT

3. Process

LLM Inference

4. Synthesize

Piper/ElevenLabs

5. Audio Out

Speaker output

# Voice chat with a single API call curl -X POST http://media-worker:8894/api/v1/voice/chat \ -H 'Content-Type: application/json' \ -d '{ "audio_data": "'$(base64 -i recording.wav)'", "model": "llama3.2:3b", "voice_id": "en_US-amy-medium", "response_format": "mp3" }' # Response includes both text and audio { "text_response": "I can help you with that...", "audio_data": "<base64-encoded-mp3>", "transcription": "What's the weather like?", "duration_ms": 2340 }

Core Capabilities

Speech-to-Text

Free+

Convert spoken audio into accurate text transcriptions.

  • Whisper.cpp local GPU acceleration
  • OpenAI Whisper API fallback
  • 99+ languages supported
  • Real-time streaming transcription
  • Speaker diarization (Pro+)

Text-to-Speech

Free+

Generate natural-sounding speech from text.

  • Piper local voices (100+ voices)
  • ElevenLabs HD voices
  • OpenAI TTS integration
  • SSML support for prosody control
  • Voice cloning (Enterprise)

Video Processing

Standard+

Extract and process video content for AI workflows.

  • FFmpeg transcoding
  • Frame extraction at intervals
  • Scene detection
  • Audio track extraction
  • Video transcription pipeline

Multimedia RAG

Standard+

Index and search audio/video content semantically.

  • Automatic transcription indexing
  • Timestamp-aligned search
  • Multi-modal embeddings
  • Integration with Data Worker
  • Citation with timestamps

STT Backends

Whisper.cpp

Local GPU-accelerated transcription with GGML models

Local
OpenAI Whisper

Cloud API for high-accuracy transcription

Cloud
Faster-Whisper

CTranslate2 optimized inference

Local
Azure Speech

Microsoft Azure Cognitive Services

Cloud
# Transcribe audio with Whisper.cpp curl -X POST http://media-worker:8894/api/v1/stt/transcribe \ -H 'Content-Type: application/json' \ -d '{ "audio_url": "https://example.com/meeting.mp3", "language": "en", "model": "large-v3", "options": { "word_timestamps": true, "diarize": true } }' # Response with word-level timestamps { "text": "Hello, welcome to the meeting...", "segments": [ { "start": 0.0, "end": 2.5, "text": "Hello, welcome to the meeting", "speaker": "SPEAKER_00" } ], "language": "en", "duration": 3600.5 }

TTS Backends

Piper

Fast local neural TTS with 100+ voices

Local
ElevenLabs

Premium HD voices with emotion control

Cloud
OpenAI TTS

Natural sounding voices via API

Cloud
Coqui TTS

Open-source multi-lingual TTS

Local
# Synthesize speech with Piper curl -X POST http://media-worker:8894/api/v1/tts/synthesize \ -H 'Content-Type: application/json' \ -d '{ "text": "Welcome to Eldric, your AI-powered assistant.", "voice_id": "en_US-amy-medium", "output_format": "mp3", "speed": 1.0 }' --output speech.mp3 # List available voices curl http://media-worker:8894/api/v1/tts/voices # Stream audio in real-time (SSE) curl -N http://media-worker:8894/api/v1/tts/stream \ -H 'Content-Type: application/json' \ -d '{"text": "This is streamed audio...", "voice_id": "en_GB-alan-medium"}'

REST API

Method Endpoint Description
GET /health Health check with backend status
GET /dashboard Web dashboard for monitoring
GET /api/v1/media/info Worker capabilities and models
POST /api/v1/stt/transcribe Transcribe audio file or URL
POST /api/v1/stt/stream Streaming transcription (SSE)
GET /api/v1/stt/models List available STT models
POST /api/v1/tts/synthesize Generate speech from text
POST /api/v1/tts/stream Stream audio output (SSE)
GET /api/v1/tts/voices List available TTS voices
POST /api/v1/video/transcribe Extract and transcribe video audio
POST /api/v1/video/extract-frames Extract frames at intervals
POST /api/v1/voice/chat Voice-in, voice-out conversation
POST /api/v1/rag/ingest-audio Index audio for RAG search
POST /api/v1/rag/ingest-video Index video for RAG search

Use Cases

1. Meeting Transcription & Summarization

Automatically transcribe meetings, generate summaries, and make content searchable.

1 Upload recording
2 STT transcription
3 Speaker diarization
4 LLM summary
5 RAG index

2. Voice-Enabled AI Assistant

Build hands-free AI interactions with natural voice input and output.

1 Voice command
2 Whisper STT
3 Agent processing
4 Piper TTS
// GUI integration - enable voice mode Settings > Voice > Enable Voice Chat STT Backend: Whisper.cpp (local) TTS Backend: Piper Voice: en_US-amy-medium Push-to-talk: Space bar

3. Video Content Library

Index your video library for AI-powered search and Q&A with timestamp citations.

1 Ingest videos
2 Extract audio
3 Transcribe
4 Vector embed
5 Semantic search

4. Podcast Production Pipeline

Automate podcast post-production with transcription, show notes, and audio enhancement.

# Full podcast processing pipeline curl -X POST http://media-worker:8894/api/v1/video/transcribe \ -d '{"video_url": "/recordings/episode-42.mp4"}' # Generate show notes with LLM curl -X POST http://inference:8890/v1/chat/completions \ -d '{ "model": "llama3.2:3b", "messages": [{"role": "user", "content": "Generate show notes: ...transcript..."}] }' # Create audio clips for social media curl -X POST http://media-worker:8894/api/v1/tts/synthesize \ -d '{"text": "Key quote from the episode...", "voice_id": "en_US-amy-medium"}'

Quick Start

1. Start Media Worker

# Run with local Whisper and Piper ./eldric-mediad --port 8894 \ --stt-backend whisper_cpp \ --whisper-model /models/ggml-large-v3.bin \ --whisper-gpu \ --tts-backend piper \ --piper-models /models/piper # Register with controller ./eldric-mediad --controller http://controller:8880 \ --data-workers http://data-worker:8892

2. Test Transcription

# Health check curl http://localhost:8894/health # Transcribe a file curl -X POST http://localhost:8894/api/v1/stt/transcribe \ -H 'Content-Type: application/json' \ -d '{"audio_url": "/path/to/audio.mp3"}' # List voices curl http://localhost:8894/api/v1/tts/voices

3. Voice Chat Example

# Record audio and send to voice chat endpoint AUDIO_B64=$(base64 -i question.wav) curl -X POST http://localhost:8894/api/v1/voice/chat \ -H 'Content-Type: application/json' \ -d "{ \"audio_data\": \"$AUDIO_B64\", \"model\": \"llama3.2:3b\", \"voice_id\": \"en_US-amy-medium\" }" | jq -r '.audio_data' | base64 -d > response.mp3 # Play the response afplay response.mp3

Licensing

Feature Free Standard Professional Enterprise
STT (basic) Yes Yes Yes Yes
STT (streaming) - Yes Yes Yes
STT (diarization) - - Yes Yes
TTS (basic) Yes Yes Yes Yes
TTS (streaming) - Yes Yes Yes
TTS (voice cloning) - - - Yes
Video transcription Yes Yes Yes Yes
Multimedia RAG - Yes Yes Yes
Max audio duration 5 min 30 min 2 hours Unlimited
Max video duration 2 min 15 min 1 hour Unlimited
Concurrent jobs 2 5 20 Unlimited
Media workers 1 2 5 Unlimited

Port Reference

Component Port Protocol Description
Media Worker 8894 HTTP/REST Audio/video processing, STT/TTS
Data Worker 8892 HTTP/REST RAG storage for multimedia indexing
AI Worker 8890 HTTP/REST LLM for voice chat responses
Controller 8880 HTTP/REST Cluster management & licensing