Python's dominance in 2025 is undeniable. With the explosion of AI applications, ML model serving, and intelligent API wrappers, Python has cemented itself as the language of choice for building the infrastructure that powers modern AI systems. If you're building an AI wrapper around OpenAI, Anthropic, or serving your own fine-tuned models, chances are you're reaching for FastAPI.
FastAPI has become the de facto standard for Python API development—and for good reason. It's fast, it has excellent async support, automatic OpenAPI documentation, and type validation out of the box. But there's a significant gap between running uvicorn main:app --reload on your laptop and deploying a production-grade API that can handle thousands of inference requests without falling over.
This guide bridges that gap. We'll walk through deploying FastAPI to production Kubernetes using Convox, covering everything from optimized Docker builds to autoscaling configuration. By the end, you'll have a production-ready deployment that can scale with your traffic and handle the unpredictable load patterns that AI applications often experience.
Before we dive into configuration, let's address a common misconception. Many developers deploy their FastAPI apps using the same command they use for local development:
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
This works, but it's fundamentally unsuitable for production. Uvicorn is an ASGI server—it's excellent at handling async requests, but running a single Uvicorn process means you're limited to a single event loop. When that event loop blocks (and it will, especially during CPU-intensive ML inference), your entire API becomes unresponsive.
The solution is a process manager. Gunicorn, traditionally a WSGI server, can manage multiple Uvicorn workers, each running its own event loop. When one worker is busy with a long-running inference call, other workers can handle incoming requests. This architecture is essential for Python containerization in production environments.
The formula for worker count is well-established: (2 × CPU cores) + 1. For a container allocated 2 CPU cores, you'd run 5 workers. This gives you genuine concurrency while leaving headroom for the system.
Container image size matters more than most developers realize. Smaller images mean faster deployments, quicker scaling, and reduced attack surface. A naive Python Dockerfile can easily balloon to 1GB+. With multi-stage builds, we can get that down to a few hundred megabytes while maintaining all the functionality we need.
Here's a production-ready Dockerfile for a FastAPI application:
# Stage 1: Build dependencies
FROM python:3.12-slim as builder
WORKDIR /app
# Install build dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
&& rm -rf /var/lib/apt/lists/*
# Create virtual environment
RUN python -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir --upgrade pip && \
pip install --no-cache-dir -r requirements.txt
# Stage 2: Production image
FROM python:3.12-slim as production
WORKDIR /app
# Copy virtual environment from builder
COPY --from=builder /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
# Create non-root user for security
RUN useradd --create-home --shell /bin/bash appuser
USER appuser
# Copy application code
COPY --chown=appuser:appuser . .
# Expose the application port
EXPOSE 8000
# Run with Gunicorn + Uvicorn workers
CMD ["gunicorn", "main:app", "-w", "4", "-k", "uvicorn.workers.UvicornWorker", "-b", "0.0.0.0:8000"]
Let's break down the key decisions here:
Multi-stage builds: The first stage installs build tools and compiles dependencies. The second stage only copies the compiled virtual environment, leaving behind all the build tooling. This typically reduces image size by 40-60%.
Virtual environment isolation: By creating a venv in the builder stage and copying it wholesale, we get clean dependency isolation and predictable paths in the final image.
Non-root user: Running as root inside containers is a security anti-pattern. Creating a dedicated user with minimal permissions reduces the blast radius of any potential vulnerability.
The CMD directive: We're running Gunicorn with 4 Uvicorn workers. This gives us genuine parallelism for handling concurrent requests. Adjust the worker count based on your container's CPU allocation.
For more complex configurations, it's worth creating a dedicated Gunicorn configuration file. This gives you fine-grained control over timeouts, logging, and worker behavior—all critical when deploying AI APIs that might have variable response times.
Create a gunicorn.conf.py file:
import multiprocessing
import os
# Worker configuration
workers = int(os.getenv("GUNICORN_WORKERS", multiprocessing.cpu_count() * 2 + 1))
worker_class = "uvicorn.workers.UvicornWorker"
worker_tmp_dir = "/dev/shm"
# Binding
bind = f"0.0.0.0:{os.getenv('PORT', '8000')}"
# Timeouts - critical for ML inference endpoints
timeout = 120 # Allow up to 2 minutes for slow inference
graceful_timeout = 30
keepalive = 5
# Logging
accesslog = "-"
errorlog = "-"
loglevel = os.getenv("LOG_LEVEL", "info")
# Security
limit_request_line = 4094
limit_request_fields = 100
limit_request_field_size = 8190
# Lifecycle hooks
def on_starting(server):
print("Gunicorn server starting...")
def on_exit(server):
print("Gunicorn server shutting down...")
Update your Dockerfile's CMD to use this configuration:
CMD ["gunicorn", "main:app", "-c", "gunicorn.conf.py"]
The timeout configuration deserves special attention. AI inference endpoints can have highly variable latency. A call to GPT-4 might return in 2 seconds or 45 seconds depending on the prompt complexity and OpenAI's current load. The 120-second timeout gives you headroom for these slow requests without leaving connections hanging indefinitely.
The worker_tmp_dir = "/dev/shm" setting is a performance optimization. Gunicorn uses temporary files for worker heartbeats, and /dev/shm is a RAM-backed filesystem that's significantly faster than disk.
Health checks are non-negotiable for production deployments. They enable zero-downtime deployments by ensuring Convox only routes traffic to healthy containers. For AI applications, a good health check should verify not just that your server is running, but that it can actually handle requests.
Here's a robust health check implementation in FastAPI:
from fastapi import FastAPI, Response
from datetime import datetime
import os
app = FastAPI(title="AI API", version="1.0.0")
@app.get("/health")
async def health_check():
"""
Health check endpoint for Convox load balancer.
Returns 200 if the service is healthy and ready to accept requests.
"""
return {
"status": "healthy",
"timestamp": datetime.utcnow().isoformat(),
"version": os.getenv("RELEASE", "unknown")
}
@app.get("/ready")
async def readiness_check():
"""
Readiness check - verifies the application can handle traffic.
Add any dependency checks here (database, cache, etc.)
"""
# Example: Check if ML model is loaded
# if not model_loaded:
# return Response(status_code=503)
return {"status": "ready"}
@app.post("/inference")
async def run_inference(prompt: str):
"""
Your actual inference endpoint.
"""
# Your ML inference logic here
pass
The health check returns the release ID from the environment—this is automatically injected by Convox as documented in the environment variables reference. This makes debugging deployment issues much easier since you can immediately verify which version is running.
For more complex applications with database dependencies, you might want to add connectivity checks to your readiness endpoint. Just be careful not to make health checks too heavy—they run frequently and shouldn't add significant load to your infrastructure.
The convox.yml manifest ties everything together. Here's a complete configuration for a production FastAPI deployment with autoscaling:
environment:
- PORT=8000
- LOG_LEVEL=info
- OPENAI_API_KEY
- ANTHROPIC_API_KEY
services:
api:
build: .
port: 8000
health:
path: /health
interval: 10
timeout: 5
grace: 30
scale:
count: 2-10
cpu: 512
memory: 1024
targets:
cpu: 70
deployment:
minimum: 50
maximum: 200
termination:
grace: 30
timeout: 180
Let's examine each section in detail.
Environment variables: Variables listed without values (like OPENAI_API_KEY) must be set using convox env set before deployment. Variables with defaults (like PORT=8000) are available to all processes. This pattern keeps secrets out of your codebase while providing sensible defaults for non-sensitive configuration. See the environment documentation for more details.
Health check configuration: The grace period of 30 seconds gives your application time to start up before health checks begin—important for Python applications that might need to load ML models or establish database connections at startup. The health checks documentation covers all available options.
Autoscaling: The scale section configures horizontal pod autoscaling. With count: 2-10, Convox maintains at least 2 instances for availability and scales up to 10 under load. The targets.cpu: 70 setting triggers scale-up when average CPU utilization exceeds 70%. This is typically the sweet spot for autoscaling Python apps—aggressive enough to handle traffic spikes, conservative enough to avoid thrashing.
Resource allocation: We're allocating 512 millicores (half a CPU) and 1024MB of memory per container. For ML inference workloads, you might need to increase these values significantly. Start conservative and monitor actual usage.
Deployment configuration: The deployment section controls rolling updates. With minimum: 50, at least half your containers remain running during deployments, ensuring continuous availability.
Timeout: The 180-second timeout accommodates slow inference requests. This should match or exceed your Gunicorn timeout configuration.
With your Dockerfile and convox.yml in place, deployment is straightforward. First, set your environment variables:
$ convox env set OPENAI_API_KEY=sk-... ANTHROPIC_API_KEY=sk-ant-... -a myapi
Setting OPENAI_API_KEY, ANTHROPIC_API_KEY... OK
Release: RABCDEFGHI
Then deploy:
$ convox deploy -a myapi
Packaging source... OK
Uploading source... OK
Starting build... OK
Building: .
Step 1/12 : FROM python:3.12-slim as builder
...
Build: BABCDEFGHI
Release: RBCDEFGHIJ
Promoting RBCDEFGHIJ...
OK
Once deployed, verify your service is running:
$ convox services -a myapi
SERVICE DOMAIN PORTS
api api.myapi.0a1b2c3d4e5f.convox.cloud 443:8000
You can monitor scaling behavior in real-time:
$ convox ps -a myapi
ID SERVICE STATUS RELEASE STARTED
api-7d8f9g0h1i-abc12 api running RBCDEFGHIJ 2 minutes ago
api-7d8f9g0h1i-def34 api running RBCDEFGHIJ 2 minutes ago
As traffic increases and CPU utilization rises above 70%, you'll see additional processes spawn automatically. When traffic subsides, Convox scales back down to the minimum count.
AI applications typically require API keys for external services—OpenAI, Anthropic, Pinecone, and others. Never commit these to your repository. Convox's environment management provides secure secret storage.
Set secrets individually or in batch:
$ convox env set \
OPENAI_API_KEY=sk-... \
ANTHROPIC_API_KEY=sk-ant-... \
PINECONE_API_KEY=... \
DATABASE_URL=postgres://... \
-a myapi
Each env set creates a new release. To apply changes without redeploying your code, promote the release:
$ convox releases promote -a myapi
This triggers a rolling restart with the new environment variables, maintaining availability throughout the update.
Once deployed, you'll want visibility into your application's behavior. Convox provides built-in logging:
$ convox logs -a myapi --follow
2025-01-15T10:30:45Z service/api/api-7d8f9g0h1i-abc12 INFO: Started server process [1]
2025-01-15T10:30:45Z service/api/api-7d8f9g0h1i-abc12 INFO: Waiting for application startup.
2025-01-15T10:30:46Z service/api/api-7d8f9g0h1i-abc12 INFO: Application startup complete.
For debugging specific issues, you can exec into a running container:
$ convox exec api-7d8f9g0h1i-abc12 bash -a myapi
appuser@api-7d8f9g0h1i-abc12:/app$
For production monitoring at scale, consider integrating with Datadog or similar observability platforms.
Deploying FastAPI to Kubernetes doesn't have to be complicated. With the right Dockerfile, proper process management, and a well-configured convox.yml, you can have a production-ready deployment that scales automatically with demand.
The key takeaways:
This architecture handles the unpredictable traffic patterns common to AI applications—sudden spikes when your product goes viral, quiet periods overnight, and everything in between.
Ready to deploy your FastAPI application? Convox offers a Getting Started Guide that walks through installation and your first deployment in detail.
Check out our Python example applications for additional reference implementations. Console accounts are free, and you can Get Started Free to deploy your first Rack in minutes.
For teams with specific compliance requirements or complex deployment needs, reach out to our team to discuss how Convox can help scale your AI infrastructure.