Pick from 30+ pre-configured templates or bring any HuggingFace model. The Deploy Wizard handles GPU selection, health checks, autoscaling, and environment configuration. Your first inference endpoint can be live in under 15 minutes.
KEDA-based autoscaling scales GPU services to zero replicas when traffic stops and spins them back up on the next request. Set per-app budget caps with automatic shutdown. GPU instances cost $1 to $32+ per hour - stop paying when nothing is running.
22 templates expose OpenAI-compatible API endpoints out of the box. Swap your OpenAI base URL for your Convox endpoint and your existing application code works unchanged. Migrate off managed APIs without rewriting anything.
Real-time dashboards for GPU utilization, VRAM, power draw, and throughput per pod. Built-in DCGM exporter with 20+ metrics. Configurable chart windows from 5 minutes to 24 hours. Grafana deep links for advanced investigation.
Everything runs in your AWS account. No data leaves your VPC. No third-party API calls. No per-request markup. Per-action RBAC with admin gates on budget and infrastructure mutations. Full audit trail with actor attribution on every event.
Not in the catalog? Enter any HuggingFace model ID and the Deploy Wizard auto-configures GPU selection, memory allocation, and serving engine. Gated models supported with your HuggingFace access token. Six serving engines: vLLM, SGLang, TGI, ComfyUI, Triton, and NIM.