KEDA-based autoscaling monitors inference traffic and scales GPU services to zero replicas when idle. When the next request arrives, your service spins back up automatically. You only pay for GPU time when your models are actively processing requests.
Set monthly USD caps per application. Choose from three enforcement modes: alert-only, block new deploys, or auto-shutdown. The system tracks cumulative spend against your cap and takes action before you get a surprise bill.
Break down GPU spend by service, instance type, and capacity type (on-demand vs. spot). View month-to-date costs across all apps or drill into individual service cost histories. Export cost data to CSV for accounting and chargeback.
GPU instances run in your AWS account at standard AWS pricing. No per-request fees. No per-token charges. No inference API markup. The only additional cost is your Convox plan. Compare that to $0.001+ per request on managed inference APIs.
See exactly how hard your GPUs are working. Real-time charts for utilization, VRAM usage, power draw, and throughput per pod. Identify underutilized instances and right-size your GPU selection. Configurable windows from 5 minutes to 24 hours.
Configure min and max replicas, scale-up thresholds, and cooldown periods per service. KEDA scales on GPU utilization, request queue depth, or custom metrics. Burst to multiple GPUs during peak traffic, drop back to zero overnight.