Eldric Media Worker

Audio/Video Processing with Speech-to-Text, Text-to-Speech, and Multimedia RAG - Transform voice into text, text into voice, and index multimedia content for AI-powered search

v4.0.0

Architecture Overview

Media Worker Processing Pipeline

Voice Chat Pipeline

Complete voice-in, voice-out conversation flow with your AI models.

1. Audio Input

Microphone or audio file

→

2. Transcribe

Whisper STT

→

3. Process

LLM Inference

→

4. Synthesize

Piper/ElevenLabs

→

5. Audio Out

Speaker output

# Voice chat with a single API call
curl -X POST http://media-worker:8894/api/v1/voice/chat \
  -H 'Content-Type: application/json' \
  -d '{
    "audio_data": "'$(base64 -i recording.wav)'",
    "model": "llama3.2:3b",
    "voice_id": "en_US-amy-medium",
    "response_format": "mp3"
  }'

# Response includes both text and audio
{
  "text_response": "I can help you with that...",
  "audio_data": "<base64-encoded-mp3>",
  "transcription": "What's the weather like?",
  "duration_ms": 2340
}
            

Core Capabilities

Speech-to-Text

Free+

Convert spoken audio into accurate text transcriptions.

Whisper.cpp local GPU acceleration
OpenAI Whisper API fallback
99+ languages supported
Real-time streaming transcription
Speaker diarization (Pro+)

Text-to-Speech

Free+

Generate natural-sounding speech from text.

Piper local voices (100+ voices)
ElevenLabs HD voices
OpenAI TTS integration
SSML support for prosody control
Voice cloning (Enterprise)

Video Processing

Standard+

Extract and process video content for AI workflows.

FFmpeg transcoding
Frame extraction at intervals
Scene detection
Audio track extraction
Video transcription pipeline

Multimedia RAG

Standard+

Index and search audio/video content semantically.

Automatic transcription indexing
Timestamp-aligned search
Multi-modal embeddings
Integration with Data Worker
Citation with timestamps

STT Backends

Whisper.cpp

Local GPU-accelerated transcription with GGML models

Local

OpenAI Whisper

Cloud API for high-accuracy transcription

Cloud

Faster-Whisper

CTranslate2 optimized inference

Local

Azure Speech

Microsoft Azure Cognitive Services

Cloud

# Transcribe audio with Whisper.cpp
curl -X POST http://media-worker:8894/api/v1/stt/transcribe \
  -H 'Content-Type: application/json' \
  -d '{
    "audio_url": "https://example.com/meeting.mp3",
    "language": "en",
    "model": "large-v3",
    "options": {
      "word_timestamps": true,
      "diarize": true
    }
  }'

# Response with word-level timestamps
{
  "text": "Hello, welcome to the meeting...",
  "segments": [
    {
      "start": 0.0,
      "end": 2.5,
      "text": "Hello, welcome to the meeting",
      "speaker": "SPEAKER_00"
    }
  ],
  "language": "en",
  "duration": 3600.5
}
            

TTS Backends

Piper

Fast local neural TTS with 100+ voices

Local

ElevenLabs

Premium HD voices with emotion control

Cloud

OpenAI TTS

Natural sounding voices via API

Cloud

Coqui TTS

Open-source multi-lingual TTS

Local

# Synthesize speech with Piper
curl -X POST http://media-worker:8894/api/v1/tts/synthesize \
  -H 'Content-Type: application/json' \
  -d '{
    "text": "Welcome to Eldric, your AI-powered assistant.",
    "voice_id": "en_US-amy-medium",
    "output_format": "mp3",
    "speed": 1.0
  }' --output speech.mp3

# List available voices
curl http://media-worker:8894/api/v1/tts/voices

# Stream audio in real-time (SSE)
curl -N http://media-worker:8894/api/v1/tts/stream \
  -H 'Content-Type: application/json' \
  -d '{"text": "This is streamed audio...", "voice_id": "en_GB-alan-medium"}'
            

REST API

Method	Endpoint	Description
GET	`/health`	Health check with backend status
GET	`/dashboard`	Web dashboard for monitoring
GET	`/api/v1/media/info`	Worker capabilities and models
POST	`/api/v1/stt/transcribe`	Transcribe audio file or URL
POST	`/api/v1/stt/stream`	Streaming transcription (SSE)
GET	`/api/v1/stt/models`	List available STT models
POST	`/api/v1/tts/synthesize`	Generate speech from text
POST	`/api/v1/tts/stream`	Stream audio output (SSE)
GET	`/api/v1/tts/voices`	List available TTS voices
POST	`/api/v1/video/transcribe`	Extract and transcribe video audio
POST	`/api/v1/video/extract-frames`	Extract frames at intervals
POST	`/api/v1/voice/chat`	Voice-in, voice-out conversation
POST	`/api/v1/rag/ingest-audio`	Index audio for RAG search
POST	`/api/v1/rag/ingest-video`	Index video for RAG search

Use Cases

1. Meeting Transcription & Summarization

Automatically transcribe meetings, generate summaries, and make content searchable.

1 Upload recording

→

2 STT transcription

→

3 Speaker diarization

→

4 LLM summary

→

5 RAG index

2. Voice-Enabled AI Assistant

Build hands-free AI interactions with natural voice input and output.

1 Voice command

→

2 Whisper STT

→

3 Agent processing

→

4 Piper TTS

// GUI integration - enable voice mode
Settings > Voice > Enable Voice Chat
  STT Backend: Whisper.cpp (local)
  TTS Backend: Piper
  Voice: en_US-amy-medium
  Push-to-talk: Space bar
                

3. Video Content Library

Index your video library for AI-powered search and Q&A with timestamp citations.

1 Ingest videos

→

2 Extract audio

→

3 Transcribe

→

4 Vector embed

→

5 Semantic search

4. Podcast Production Pipeline

Automate podcast post-production with transcription, show notes, and audio enhancement.

# Full podcast processing pipeline
curl -X POST http://media-worker:8894/api/v1/video/transcribe \
  -d '{"video_url": "/recordings/episode-42.mp4"}'

# Generate show notes with LLM
curl -X POST http://inference:8890/v1/chat/completions \
  -d '{
    "model": "llama3.2:3b",
    "messages": [{"role": "user", "content": "Generate show notes: ...transcript..."}]
  }'

# Create audio clips for social media
curl -X POST http://media-worker:8894/api/v1/tts/synthesize \
  -d '{"text": "Key quote from the episode...", "voice_id": "en_US-amy-medium"}'
                

Quick Start

1. Start Media Worker

# Run with local Whisper and Piper
./eldric-mediad --port 8894 \
  --stt-backend whisper_cpp \
  --whisper-model /models/ggml-large-v3.bin \
  --whisper-gpu \
  --tts-backend piper \
  --piper-models /models/piper

# Register with controller
./eldric-mediad --controller http://controller:8880 \
  --data-workers http://data-worker:8892
            

2. Test Transcription

# Health check
curl http://localhost:8894/health

# Transcribe a file
curl -X POST http://localhost:8894/api/v1/stt/transcribe \
  -H 'Content-Type: application/json' \
  -d '{"audio_url": "/path/to/audio.mp3"}'

# List voices
curl http://localhost:8894/api/v1/tts/voices
            

3. Voice Chat Example

# Record audio and send to voice chat endpoint
AUDIO_B64=$(base64 -i question.wav)

curl -X POST http://localhost:8894/api/v1/voice/chat \
  -H 'Content-Type: application/json' \
  -d "{
    \"audio_data\": \"$AUDIO_B64\",
    \"model\": \"llama3.2:3b\",
    \"voice_id\": \"en_US-amy-medium\"
  }" | jq -r '.audio_data' | base64 -d > response.mp3

# Play the response
afplay response.mp3
            

Licensing

Feature	Free	Standard	Professional	Enterprise
STT (basic)	Yes	Yes	Yes	Yes
STT (streaming)	-	Yes	Yes	Yes
STT (diarization)	-	-	Yes	Yes
TTS (basic)	Yes	Yes	Yes	Yes
TTS (streaming)	-	Yes	Yes	Yes
TTS (voice cloning)	-	-	-	Yes
Video transcription	Yes	Yes	Yes	Yes
Multimedia RAG	-	Yes	Yes	Yes
Max audio duration	5 min	30 min	2 hours	Unlimited
Max video duration	2 min	15 min	1 hour	Unlimited
Concurrent jobs	2	5	20	Unlimited
Media workers	1	2	5	Unlimited

Port Reference

Component	Port	Protocol	Description
Media Worker	8894	HTTP/REST	Audio/video processing, STT/TTS
Data Worker	8892	HTTP/REST	RAG storage for multimedia indexing
AI Worker	8890	HTTP/REST	LLM for voice chat responses
Controller	8880	HTTP/REST	Cluster management & licensing