Eldric Inference Backends

Both Eldric Client and Multi-API support multiple backends. Mix local inference with cloud APIs across your infrastructure.

OpenAI-Compatible Streaming

Universal SSE Streaming

All backends support real-time token streaming via Server-Sent Events (SSE). Use stream: true in your /v1/chat/completions request. The streaming flows seamlessly through Edge → Router → Worker → Backend.

Unified Backend Features

Streaming

SSE (Server-Sent Events)
Real-time token delivery
OpenAI-compatible format
Zero-copy proxy

Unified API

/v1/chat/completions
/v1/models
/v1/embeddings
Same API for all backends

Load Balancing

Round-robin / Least connections
AI-powered routing
Automatic failover
Health monitoring

Multi-Backend

Mix local + cloud
Fallback chains
Per-model routing
Hot backend switching

Local & Self-Hosted

Ollama

Port: 11434
REST API
Auto model discovery
Default backend

vLLM

Port: 8000
OpenAI-compatible
PagedAttention
High throughput

llama.cpp

Port: 8080
REST + WebSocket
GGUF models
CPU + GPU

HuggingFace TGI

Port: 8080
REST + gRPC
Tensor parallelism
Continuous batching

LocalAI

Port: 8080
OpenAI-compatible
Multiple formats
CPU optimized

ExLlamaV2

Port: 5000
REST API
GPTQ/EXL2 quants
Fast inference

LMDeploy

Port: 23333
OpenAI-compatible
TurboMind engine
Quantization

MLC LLM

Port: 8080
REST API
Universal deploy
WebGPU support

Enterprise & ML Platforms

NVIDIA Triton

Port: 8000-8002
REST + gRPC
TensorRT optimization
Multi-framework

NVIDIA NIM

Port: 8000
OpenAI-compatible
Optimized containers
Enterprise ready

TensorFlow Serving

Port: 8501/8500
REST + gRPC
Model versioning
Batch prediction

TorchServe

Port: 8080/8081
REST + gRPC
PyTorch native
Model archive

ONNX Runtime

Port: 8001
REST + gRPC
Cross-platform
Hardware agnostic

DeepSpeed-MII

Port: 28080
REST API
ZeRO-Inference
Low latency

BentoML

Port: 3000
REST + gRPC
Model packaging
Adaptive batching

Ray Serve

Port: 8000
REST API
Auto-scaling
Distributed

Cloud AI Services

AWS SageMaker

HTTPS endpoint
REST API
Auto-scaling
Multi-model

AWS Bedrock

HTTPS endpoint
REST API
Foundation models
Managed service

Azure ML

HTTPS endpoint
REST + SDK
Managed compute
MLflow integration

Azure OpenAI

HTTPS endpoint
OpenAI-compatible
Enterprise security
Regional deploy

Google Vertex AI

HTTPS endpoint
REST + gRPC
TPU support
Model Garden

Groq

HTTPS API
OpenAI-compatible
LPU inference
Ultra-fast

Together AI

HTTPS API
OpenAI-compatible
Open models
Fine-tuning

Fireworks AI

HTTPS API
OpenAI-compatible
Fast inference
Function calling

Anyscale

HTTPS API
OpenAI-compatible
Ray-based
Scalable

Replicate

HTTPS API
REST API
Model hosting
Pay-per-use

Model Provider APIs

OpenAI

HTTPS API
REST API
GPT-4, GPT-4o
Assistants API

Anthropic

HTTPS API
REST API
Claude models
Tool use

Google Gemini

HTTPS API
REST API
Gemini Pro/Ultra
Multimodal

Mistral AI

HTTPS API
OpenAI-compatible
Mistral/Mixtral
Function calling

Cohere

HTTPS API
REST API
Command models
Embeddings + Rerank

AI21 Labs

HTTPS API
REST API
Jurassic models
Specialized tasks

Specialized & Platform-Specific

MLX (Apple Silicon)

Port: 8080
REST API
Metal acceleration
Unified memory

KServe

Port: 8080
REST + gRPC
Kubernetes native
Serverless

Seldon Core

Port: 9000
REST + gRPC
ML deployment
A/B testing

OpenAI-Compatible

Any port
Custom endpoints
API key auth
Drop-in support

Backend by Use Case

Use Case	Recommended Backends	Why
Development	Ollama, LocalAI, LMDeploy	Easy setup, free, local
Production API	vLLM, TGI, Triton, NIM	High throughput, batching, enterprise
Edge / IoT	llama.cpp, MLC LLM, ExLlamaV2	CPU inference, small footprint, quantized
Apple Silicon	MLX, Ollama, MLC LLM	Metal acceleration, unified memory
Low Latency	Groq, Fireworks, DeepSpeed-MII	Optimized hardware, fast inference
Enterprise Cloud	Azure OpenAI, Bedrock, Vertex AI	Compliance, SLA, managed
Open Models	Together AI, Anyscale, Replicate	Llama, Mistral, open weights
Kubernetes	KServe, Seldon, Ray Serve	Cloud-native, auto-scaling

Streaming & Feature Support

Backend	Type	Streaming	Vision	Tools	Embeddings
Ollama	Local	✓	✓	✓	✓
vLLM	Enterprise	✓	✓	✓	✓
TGI	Enterprise	✓	✓	—	—
NVIDIA Triton	Enterprise	✓	✓	—	✓
llama.cpp	Local	✓	✓	—	✓
MLX	Local (macOS)	✓	—	—	—
OpenAI	Cloud	✓	✓	✓	✓
Anthropic	Cloud	✓	✓	✓	—
Groq	Cloud	✓	—	✓	—
Together AI	Cloud	✓	✓	✓	✓
Azure OpenAI	Cloud	✓	✓	✓	✓

✓ = Supported, — = Not available for this backend