Both Eldric Client and Multi-API support multiple backends. Mix local inference with cloud APIs across your infrastructure.
OpenAI-Compatible Streaming
Universal SSE Streaming
All backends support real-time token streaming via Server-Sent Events (SSE). Use stream: true in your /v1/chat/completions request. The streaming flows seamlessly through Edge → Router → Worker → Backend.
Unified Backend Features
Streaming
SSE (Server-Sent Events)
Real-time token delivery
OpenAI-compatible format
Zero-copy proxy
Unified API
/v1/chat/completions
/v1/models
/v1/embeddings
Same API for all backends
Load Balancing
Round-robin / Least connections
AI-powered routing
Automatic failover
Health monitoring
Multi-Backend
Mix local + cloud
Fallback chains
Per-model routing
Hot backend switching
Local & Self-Hosted
Ollama Port: 11434 REST API Auto model discovery Default backend
vLLM Port: 8000 OpenAI-compatible PagedAttention High throughput
llama.cpp Port: 8080 REST + WebSocket GGUF models CPU + GPU
HuggingFace TGI Port: 8080 REST + gRPC Tensor parallelism Continuous batching
LocalAI Port: 8080 OpenAI-compatible Multiple formats CPU optimized
ExLlamaV2 Port: 5000 REST API GPTQ/EXL2 quants Fast inference
LMDeploy Port: 23333 OpenAI-compatible TurboMind engine Quantization
MLC LLM Port: 8080 REST API Universal deploy WebGPU support
Enterprise & ML Platforms
NVIDIA Triton Port: 8000-8002 REST + gRPC TensorRT optimization Multi-framework
NVIDIA NIM Port: 8000 OpenAI-compatible Optimized containers Enterprise ready
TensorFlow Serving Port: 8501/8500 REST + gRPC Model versioning Batch prediction
TorchServe Port: 8080/8081 REST + gRPC PyTorch native Model archive
ONNX Runtime Port: 8001 REST + gRPC Cross-platform Hardware agnostic
DeepSpeed-MII Port: 28080 REST API ZeRO-Inference Low latency
BentoML Port: 3000 REST + gRPC Model packaging Adaptive batching
Ray Serve Port: 8000 REST API Auto-scaling Distributed
Cloud AI Services
AWS SageMaker HTTPS endpoint REST API Auto-scaling Multi-model
AWS Bedrock HTTPS endpoint REST API Foundation models Managed service
Azure ML HTTPS endpoint REST + SDK Managed compute MLflow integration
Azure OpenAI HTTPS endpoint OpenAI-compatible Enterprise security Regional deploy
Google Vertex AI HTTPS endpoint REST + gRPC TPU support Model Garden
Groq HTTPS API OpenAI-compatible LPU inference Ultra-fast
Together AI HTTPS API OpenAI-compatible Open models Fine-tuning
Fireworks AI HTTPS API OpenAI-compatible Fast inference Function calling
Anyscale HTTPS API OpenAI-compatible Ray-based Scalable
Replicate HTTPS API REST API Model hosting Pay-per-use
Model Provider APIs
OpenAI HTTPS API REST API GPT-4, GPT-4o Assistants API
Anthropic HTTPS API REST API Claude models Tool use
Google Gemini HTTPS API REST API Gemini Pro/Ultra Multimodal
Mistral AI HTTPS API OpenAI-compatible Mistral/Mixtral Function calling
Cohere HTTPS API REST API Command models Embeddings + Rerank
AI21 Labs HTTPS API REST API Jurassic models Specialized tasks
Specialized & Platform-Specific
MLX (Apple Silicon) Port: 8080 REST API Metal acceleration Unified memory
KServe Port: 8080 REST + gRPC Kubernetes native Serverless
Seldon Core Port: 9000 REST + gRPC ML deployment A/B testing
OpenAI-Compatible Any port Custom endpoints API key auth Drop-in support
Backend by Use Case
Use Case
Recommended Backends
Why
Development
Ollama, LocalAI, LMDeploy
Easy setup, free, local
Production API
vLLM, TGI, Triton, NIM
High throughput, batching, enterprise
Edge / IoT
llama.cpp, MLC LLM, ExLlamaV2
CPU inference, small footprint, quantized
Apple Silicon
MLX, Ollama, MLC LLM
Metal acceleration, unified memory
Low Latency
Groq, Fireworks, DeepSpeed-MII
Optimized hardware, fast inference
Enterprise Cloud
Azure OpenAI, Bedrock, Vertex AI
Compliance, SLA, managed
Open Models
Together AI, Anyscale, Replicate
Llama, Mistral, open weights
Kubernetes
KServe, Seldon, Ray Serve
Cloud-native, auto-scaling
Streaming & Feature Support
Backend
Type
Streaming
Vision
Tools
Embeddings
Ollama
Local
✓
✓
✓
✓
vLLM
Enterprise
✓
✓
✓
✓
TGI
Enterprise
✓
✓
—
—
NVIDIA Triton
Enterprise
✓
✓
—
✓
llama.cpp
Local
✓
✓
—
✓
MLX
Local (macOS)
✓
—
—
—
OpenAI
Cloud
✓
✓
✓
✓
Anthropic
Cloud
✓
✓
✓
—
Groq
Cloud
✓
—
✓
—
Together AI
Cloud
✓
✓
✓
✓
Azure OpenAI
Cloud
✓
✓
✓
✓
✓ = Supported, — = Not available for this backend
Availability
Eldric Client (CLI + GUI): Ollama, vLLM, llama.cpp, TGI, MLX, OpenAI-compatible endpoints
Eldric Multi-API: All 32+ backends with unified API, load balancing, streaming, and failover