Back to Blog

The Complete Convox v2 to v3 Migration Guide: Pitfalls, Breaking Changes, and Lessons from the Field

The Paradigm Shift: ECS to Kubernetes

If you've been running Convox v2, you know the magic: a single convox deploy turns your code into a running, load-balanced, auto-scaling application. That magic isn't going away. But underneath, everything has changed.

Convox v3 replaces AWS Elastic Container Service (ECS) with Kubernetes — specifically EKS on AWS, with GKE, AKS, and DOKS also supported. This isn't a minor version bump. It's an architectural overhaul that unlocks multi-cloud portability, standard Kubernetes ecosystem compatibility, advanced deployment controls, and granular workload placement. Your convox.yml still works. Your CLI commands still work. But the behaviors you've internalized over years of v2 usage — how logs appear, how resources are provisioned, how costs accumulate — have fundamentally shifted.

This guide is written for the v2 user who loves Convox's simplicity and doesn't want Kubernetes to steal it. We'll be transparent about the hard parts, show you exactly how to handle them, and make sure your migration goes smoothly.

Deployment Debugging: The #1 Migration Friction Point

Let's start with the issue that generates the most confusion during migration: your deployment hangs, you run convox logs, and you see absolutely nothing.

In v2, logs from your containers streamed immediately, regardless of health check status. In v3, convox logs only shows output from containers that have passed their Kubernetes readiness probe. If your app crashes on startup, fails its health check, or takes too long to boot, the container is running — but from Convox's perspective, it doesn't exist yet. This is by design in Kubernetes, but it's disorienting if you're used to v2's behavior.

The kubectl Escape Hatch

To debug a stuck deployment in v3, you need to go one layer deeper. First, export your Rack's Kubernetes configuration:

$ convox rack kubeconfig -r myorg/production > ~/.kube/config

Now find the pods for your app. Convox namespaces apps as rackName-appName:

$ kubectl get pods -n production-myapp
NAME                   READY   STATUS             RESTARTS   AGE
web-7f8b4d6c9-x2k4j   0/1     CrashLoopBackOff   4          3m

There's your problem — CrashLoopBackOff. Now pull the logs directly from the container, including previous crashed instances:

$ kubectl logs -n production-myapp web-7f8b4d6c9-x2k4j --previous

For even more detail — like why Kubernetes couldn't schedule the pod at all — use describe:

$ kubectl describe pod -n production-myapp web-7f8b4d6c9-x2k4j

If a deployment is hanging and you need to abort immediately, use:

$ convox apps cancel -a myapp

This triggers an immediate rollback to your last known good release. Bookmark these commands — they'll save you hours during migration.

For a comprehensive reference on how every Convox concept maps to Kubernetes resources, including naming patterns, labels, and kubectl commands for every resource type, check out our Convox to Kubernetes Resource Mapping guide.

The NAT Gateway Cost Surprise (And How to Fix It)

In v2, your EC2 instances ran in public subnets. They had direct internet access, and outbound traffic was essentially free beyond standard data transfer charges. In v3, worker nodes run in private subnets by default. This is a security best practice — your compute is no longer directly addressable from the internet — but it introduces a cost that catches many users off guard: NAT Gateways.

Every outbound internet request from your private-subnet nodes (pulling Docker images from ECR, calling external APIs, sending webhooks) must pass through a NAT Gateway. On AWS, each NAT Gateway costs approximately $32/month plus $0.045 per GB processed. In a High Availability setup with 3 Availability Zones, that's a base cost of roughly $100/month before any data transfer.

The real cost danger comes from traffic that leaves your VPC when it doesn't need to. If your app connects to an external managed database like MongoDB Atlas, or calls any third-party service using public endpoints, all of that traffic flows through NAT. For services that live inside your VPC or a peered VPC, the fix is making sure you're using private hostnames or IP addresses so traffic stays on the AWS backbone and never touches NAT at all. AWS-managed services like RDS generally handle this well when accessed via their private endpoints, but self-managed or external databases are where costs can quietly spiral.

For internal service-to-service communication within your cluster, Kubernetes DNS is the way to go. In Convox, any service can reach another service in the same app using the pattern:

convoxServiceName.rackName-appName.svc.cluster.local

This routes traffic entirely within the cluster, bypassing NAT, load balancers, and even the AWS network layer. If your services are calling each other over public URLs or external DNS names, switching to internal DNS can meaningfully reduce your NAT bill.

How to Mitigate NAT Costs

There are several strategies to keep NAT costs under control:

  • VPC Endpoints: Set up VPC Gateway Endpoints for S3 and Interface Endpoints for ECR. These route traffic directly within the AWS network, bypassing NAT entirely for your heaviest traffic sources (Docker image pulls and artifact storage).
  • Private DNS and internal routing: Always use private hostnames or IP addresses for resources within your VPC or peered VPCs. For service-to-service calls within your cluster, use Kubernetes DNS (serviceName.rackName-appName.svc.cluster.local) to keep traffic off NAT entirely.
  • Review external database connections: If you're using externally hosted databases like MongoDB Atlas, Fauna, PlanetScale, or similar services, those connections route through NAT on every query. Consider VPC peering or private endpoints where the provider supports them.
  • Use private=false for non-production: For development and staging Racks where compliance isn't a concern, you can set the private rack parameter to false during installation. This places worker nodes in public subnets and eliminates NAT costs entirely.

⚠️ The private parameter is immutable after installation. You cannot change it without reinstalling the Rack. Plan accordingly.

Rack Parameters to Be Aware of on Day One

Several v3 rack parameters are either immutable or have outsized impact on cost, stability, and developer experience. Here are the ones worth evaluating before installing your first v3 Rack.

Build Node: Stop Building on Production Nodes

In v2, Convox always provisioned a dedicated build instance. In v3, builds run on your main cluster nodes by default. This means a large Docker build can consume CPU and memory that your production services need, leading to resource contention, OOM kills, and degraded performance.

The fix is straightforward. When installing or updating your Rack, enable a dedicated build node:

$ convox rack params set build_node_enabled=true build_node_min_count=0

Setting build_node_min_count=0 means the build node scales to zero after 30 minutes of inactivity, so you only pay for it during active builds. We strongly recommend enabling this for any Rack running production workloads.

CIDR Planning: The Immutable Parameter

The cidr rack parameter (default 10.0.0.0/16) defines the IP address range for your VPC. Like private, it is immutable after installation. If you plan to use VPC peering to connect your Rack to other VPCs — for example, to reach a shared database or an internal router — you must ensure your CIDR blocks don't overlap. Plan your network topology before you install.

Pod Disruption Budgets: Unblocking Scale-Down

v3 introduces Kubernetes Pod Disruption Budgets (PDBs) via the pdb_default_min_available_percentage rack parameter, which defaults to 50. This means at least 50% of your pods must remain available during voluntary disruptions like node drains or cluster autoscaling.

Here's the gotcha: if you have a service running with count: 1, 50% of 1 rounds up to 1. The PDB tells Kubernetes that 1 pod must always be available, which means the pod can never be voluntarily evicted. This blocks the cluster autoscaler from draining and removing empty or underutilized nodes, silently inflating your compute costs.

For development Racks or services that can tolerate brief downtime, consider lowering this value or scaling single-replica services to count: 2 in production.

convox.yml Syntax: The Critical Differences

Your convox.yml will need updates. Most are straightforward, but a few are significant enough to break your app if missed. Here's a complete comparison of the changes that matter most.

Resources: Don't Accidentally Containerize Your Production Database

This is the migration change most likely to catch you off guard. In v2, defining type: postgres in your resources section provisioned an AWS RDS instance — a managed, durable, backed-up database. In v3, type: postgres provisions a containerized database running inside your cluster. This is great for development but not what you want in production.

To get an actual RDS instance in v3, you must explicitly use type: rds-postgres:

v2 syntax (provisions RDS):

resources:
  database:
    type: postgres
    options:
      storage: 100

v3 syntax (provisions RDS):

resources:
  database:
    type: rds-postgres
    options:
      storage: 100

The same applies to MySQL (rds-mysql), MariaDB (rds-mariadb), and Redis (elasticache-redis). Review your resource definitions carefully before deploying to v3.

Importing an Existing RDS Database

If you already have an RDS instance running outside of Convox — whether it was created manually, by Terraform, or by your v2 Rack — you don't have to start from scratch. Convox v3 supports importing existing RDS databases (and ElastiCache instances) directly into your app's resource definitions.

To import, use the import option in your resource definition and provide the masterUserPassword as an environment variable reference:

resources:
  mydb:
    type: rds-postgres
    options:
      import: my-existing-rds-instance-identifier
      masterUserPassword: ${MYDBPASS}
services:
  web:
    resources:
      - mydb

Before deploying, set the password via convox env set:

$ convox env set MYDBPASS=my_secure_password -a myapp
Setting MYDBPASS... OK
Release: RABCDEFGHI

While the import option is set, Convox treats the database as a passive linked resource. It won't modify the database's configuration, and no other options in the resource definition will be applied. Convox will inject the connection URL and credential environment variables into your linked services just like any other resource.

When you're ready for Convox to take over full management of the imported database, simply remove the import option (and the masterUserPassword reference) from your convox.yml and redeploy. At that point, Convox will begin managing the database's lifecycle and any configured options will take effect.

One important safety note: if an application is deleted, any RDS databases it created will also be deleted. For imported databases, this does not apply — the database will remain intact and must be manually removed. That said, we strongly recommend enabling deletionProtection on any production database regardless of how it was provisioned.

Resource Overlays: A Simpler Alternative

If you'd rather skip the import process and just point your app at an existing database without Convox managing it at all, you can use a Resource Overlay. Set the resource's URL directly as an environment variable, and Convox will use that instead of provisioning anything:

$ convox env set MAIN_URL=postgres://user:pass@your-rds-host:5432/dbname -a myapp
Setting MAIN_URL... OK
Release: RABCDEFGHI

When a matching environment variable is set (e.g., MAIN_URL for a resource named main), Convox won't start a containerized resource for it. This is a great approach for teams that manage their databases through other tools and just need Convox to connect to them.

Resource Allocation: How Scheduling Works in v3

In v2, scale.memory set a hard memory limit. If your process exceeded it, ECS killed it. Simple. In v3, the model is different because Kubernetes separates scheduling from enforcement.

The values you set in scale.cpu and scale.memory become Kubernetes resource requests. These are the guaranteed minimums that the scheduler uses to decide which node your pod runs on. A pod requesting 512 MB of memory will only be placed on a node with at least 512 MB available. This is the primary mechanism for ensuring your services have the resources they need, and getting your requests right is the single most important thing to focus on.

Convox also supports limit.cpu and limit.memory, which set hard caps. If a pod exceeds its memory limit, Kubernetes kills it (OOMKill). If it exceeds its CPU limit, it gets throttled. However, in most cases we'd recommend starting without limits and only adding them if you have a specific reason to. Overly tight limits are one of the most common causes of deployment issues on Kubernetes — pods get OOM-killed or CPU-throttled during normal traffic spikes, leading to restarts, failed health checks, and cascading problems that can be hard to trace back to a resource constraint.

Here's a typical configuration with just requests:

services:
  web:
    build: .
    port: 3000
    scale:
      count: 2       # Static replica count
      cpu: 256       # Request: guaranteed 256m CPU
      memory: 512    # Request: guaranteed 512 MB RAM

Static Count vs Autoscaling

The scale.count value controls how many replicas of your service are running. Set it as a fixed number for a static deployment, or as a range to enable Kubernetes Horizontal Pod Autoscaling (HPA):

services:
  web:
    build: .
    port: 3000
    scale:
      count: 2-10     # Autoscale between 2 and 10 replicas
      cpu: 256
      memory: 512
      targets:
        cpu: 70        # Scale up when avg CPU exceeds 70%

When you specify a range, Convox creates a HorizontalPodAutoscaler that monitors your pods' actual resource usage against the targets you define. The targets section tells Kubernetes when to scale: a cpu: 70 target means new replicas are added when average CPU utilization across existing pods exceeds 70% of the requested CPU.

Accurate resource requests are critical here. If your requests are too high relative to actual usage, utilization will always appear low and HPA will never scale up. If they're too low, HPA will trigger scaling constantly. A good starting point is to set requests close to your service's steady-state usage and let HPA handle the spikes.

Rolling Deployments

v3 uses Kubernetes rolling deployments by default, which means new pods are brought up before old pods are terminated. The deployment block controls how aggressively this rollout happens:

services:
  web:
    build: .
    port: 3000
    scale:
      count: 4
      cpu: 256
      memory: 512
    deployment:
      minimum: 50    # At least 50% of pods stay available during deploy
      maximum: 200   # Up to 200% of desired count can exist during rollout

With these settings and a count of 4, Kubernetes will keep at least 2 pods running at all times during a deploy and can spin up as many as 8 total while transitioning. This gives you zero-downtime deployments, but keep in mind that during the rollout you temporarily need enough cluster capacity to run both old and new pods.

If you're coming from v2 where deployments were more opaque, the key mental shift is that your scale requests directly affect scheduling, autoscaling behavior, and rollout capacity. Getting them right is more impactful than setting limits.

Timers, Probes, and Other Syntax Gotchas

Beyond the big changes, there are several smaller syntax differences that can trip you up during migration.

Timer Cron Syntax

v2 used AWS CloudWatch's 6-field cron expression format (including a year field and the ? character). v3 uses standard Kubernetes 5-field cron. Your timers will fail validation if you don't update the syntax.

Version Syntax Example (3 AM daily)
v2 6-field (min hour dom month dow year) 0 3 * * ? *
v3 5-field (min hour dom month dow) 0 3 * * *

Termination Grace Period

The drain key is deprecated in v3. Replace it with the termination.grace syntax:

v2:

services:
  web:
    drain: 60

v3:

services:
  web:
    termination:
      grace: 60

Startup Probes: A Game-Changer for Slow-Booting Apps

v3 introduces startupProbe via the health check configuration. If you've ever had a Rails app that takes 30+ seconds to boot (asset compilation, seed data loading, ML model initialization), you know the frustration of Kubernetes killing it before it finishes starting because the liveness probe timed out.

The startup probe gates the liveness and readiness probes entirely. Until the startup probe passes, Kubernetes won't even begin checking the other probes. This gives slow-booting apps the breathing room they need without weakening your ongoing health checks:

services:
  web:
    build: .
    port: 3000
    health:
      path: /health
      interval: 10
      timeout: 5
    startup:
      path: /health
      timeout: 120   # Give the app up to 2 minutes to boot

If you're migrating a monolithic application, startup probes alone can eliminate 90% of your deployment-related headaches.

port vs ports: Public Ingress vs Internal Routing

This is one of the more confusing syntax changes in v3. In v2, you had a single port: key and a ports: array that handled everything. In v3, they serve entirely different purposes:

  • port: 3000 (Singular) — Exposes the service via public HTTPS ingress on port 443, routed to your container on port 3000. This is what most web services need.
  • ports: (Plural) — Exposes ports for internal cluster communication or for use with custom Balancers. These ports are not publicly accessible by default.

v2 syntax:

services:
  web:
    build: .
    port: 3000

v3 syntax (identical for basic web services):

services:
  web:
    build: .
    port: 3000

For TCP or UDP services (like a game server, MQTT broker, or gRPC endpoint), v3 introduces the balancers: block, which provisions a dedicated Network Load Balancer. Note that the service must still define a port (singular) that is different from the ports listed in ports:. Convox uses this port for standard ingress, health checks, and internal service management:

balancers:
  gameserver:
    service: game
    ports:
      7777: 7777
services:
  game:
    build: .
    port: 3000    # Required: used by Convox for health checks and ingress
    ports:
      - 7777      # Exposed via the NLB balancer above

If you omit the port value or set it to the same value as one of your ports: entries, Convox won't be able to properly manage health checks or route traffic to the service.

Workload Placement: Right Workloads on the Right Infrastructure

One of the most powerful features you gain in v3 is fine-grained control over where your workloads run. In v2, all your services shared the same pool of EC2 instances. In v3, you can create custom node groups with different instance types, capacity modes, and scaling rules, and then direct specific services to specific node groups.

This is useful for a lot of real-world scenarios: running CPU-heavy batch workers on compute-optimized instances, putting your web frontend on on-demand nodes while background jobs run on cheaper spot instances, or isolating sensitive workloads onto dedicated node pools.

Custom Node Groups

You define custom node groups at the Rack level using the additional_node_groups_config parameter. Create a JSON file with your node group definitions:

[
  {
    "id": 101,
    "type": "t3.medium",
    "capacity_type": "ON_DEMAND",
    "min_size": 1,
    "max_size": 5,
    "label": "web-services",
    "tags": "environment=production,team=frontend"
  },
  {
    "id": 102,
    "type": "c5.large",
    "capacity_type": "SPOT",
    "min_size": 0,
    "max_size": 10,
    "label": "batch-workers",
    "tags": "environment=production,team=data"
  }
]

Then apply it to your Rack:

$ convox rack params set additional_node_groups_config=/path/to/node-groups.json -r rackName

Each node group gets a label value that becomes a Kubernetes label (convox.io/label) on those nodes. You can also assign a unique id to each group so it doesn't get recreated during config updates, and use the tags field to apply AWS resource tags for cost tracking.

Targeting Node Groups from Your Services

Once your node groups are in place, use nodeSelectorLabels in your convox.yml to direct services to specific groups:

services:
  web:
    build: .
    port: 3000
    nodeSelectorLabels:
      convox.io/label: web-services
  worker:
    build: ./worker
    nodeSelectorLabels:
      convox.io/label: batch-workers

If you want softer placement preferences rather than hard requirements, you can use nodeAffinityLabels with weights. This tells Kubernetes to prefer certain nodes without failing the deployment if they're not available:

services:
  web:
    nodeSelectorLabels:
      convox.io/label: web-services
    nodeAffinityLabels:
      - weight: 10
        label: node.kubernetes.io/instance-type
        value: t3a.large
      - weight: 1
        label: node.kubernetes.io/instance-type
        value: t3a.medium

In this example, the service will always run on the web-services node group, but Kubernetes will prefer t3a.large instances within that group (weight 10) over t3a.medium instances (weight 1).

Dedicated Build Nodes

Workload placement isn't just for services. You can also create dedicated node groups specifically for builds using additional_build_groups_config, and then target them with the BuildLabels app parameter:

$ convox rack params set additional_build_groups_config='[{"id":201,"type":"c5.xlarge","capacity_type":"SPOT","min_size":0,"max_size":3,"label":"app-build","disk":100}]' -r rackName
$ convox apps params set BuildLabels=convox.io/label=app-build -a myapp

This puts your builds on beefy, cost-effective spot instances that scale to zero when idle, keeping them completely isolated from your production workloads.

For the full set of node group configuration options (including dedicated node pools, AMI overrides, and disk sizing), see the Workload Placement documentation.

The Complete Migration Comparison

Here's a summary of every significant change between v2 and v3, consolidated for quick reference:

Area v2 (ECS) v3 (Kubernetes)
Network EC2 in public subnets Nodes in private subnets (NAT required)
Logs Stream immediately Hidden until readiness probe passes
Builds Dedicated build instance Main cluster (enable build_node_enabled)
Database Resources type: postgres → RDS type: postgres → Container. Use rds-postgres for RDS.
Timer Cron 6-field (0 3 * * ? *) 5-field (0 3 * * *)
Public Ports port: 3000 port: 3000 (same, but ports: now = internal only)
TCP/UDP Via ports array Requires balancers: block (NLB)
Memory/CPU scale.memory = hard limit scale.memory = request (scheduling guarantee). Limits optional.
Autoscaling ECS service auto scaling scale.count: 2-10 with targets creates HPA
Service Discovery links: injects [SERVICE]_URL K8s DNS: service.rackName-appName.svc.cluster.local
Termination drain: 30 termination: grace: 30
IAM Instance-level or Kiam EKS Pod Identity (pod_identity_agent_enable=true)
Existing DBs Manual env var or Kiam import option or Resource Overlay via env var
Workload Placement Shared instance pool Custom node groups with nodeSelectorLabels and nodeAffinityLabels

Make the Jump

Migrating from v2 to v3 takes some planning, but if you've followed along with this guide, you already have a clear picture of what needs to change and why. Most of the work is mechanical: updating resource types, adjusting cron syntax, adding limit blocks. Once you're through it, you get standard Kubernetes ecosystem compatibility, multi-cloud portability, startup probes, Pod Disruption Budgets, custom NLBs, granular IAM via Pod Identity, and the entire universe of Kubernetes tooling — all while keeping the convox deploy simplicity that made you choose Convox in the first place.

Start with a staging Rack, use the checklist above, and give yourself a bit of time to get comfortable with the debugging workflow. The syntax changes are the easy part. The mental model shift — from ECS to Kubernetes — is what takes more effort, but this guide has given you the map.

Ready to make the jump? Sign up for a free Convox account and spin up a v3 staging Rack today. Walk through our Getting Started Guide to see how the new architecture works firsthand, or explore our example applications for reference implementations.

For teams planning a production migration and want help with architecture review, VPC planning, or cost optimization, reach out to our team — we've helped hundreds of teams navigate this transition successfully.

Resources

For enterprise migrations, architecture reviews, or compliance questions: sales@convox.com

Let your team focus on what matters.