Kubernetes Autoscaling: HPA and VPA in Practice

Manually bumping replica counts during traffic spikes. It works, until that one time nobody notices until the alerts fire. Or worse — nobody scales back down and the cloud bill doubles. Kubernetes ships with two autoscaling mechanisms: the Horizontal Pod Autoscaler (HPA) and the Vertical Pod Autoscaler (VPA). Both solve different problems, and they work in fundamentally different ways.

HPA: more pods under more load

The HPA is the more widely known option. The concept is straightforward: when CPU or memory crosses a threshold, new pods spin up. When load drops, they get removed.

A basic HPA configuration:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Pods
        value: 1
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
      - type: Pods
        value: 2
        periodSeconds: 60

That behavior block matters more than most people realize. Without it, the HPA's default scaleDown is aggressive — half the pods can vanish within a minute after a brief traffic dip. A 300-second stabilization window prevents that kind of thrashing.

Resource requests: get the basics right

An HPA that scales on CPU utilization measures against the resource request. No requests defined? The HPA simply doesn't work. This is by far the most common mistake.

resources:
  requests:
    cpu: 250m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

Those requests need to be realistic. Set them too low and the HPA triggers constantly. Set them too high and scaling never kicks in despite actual CPU pressure. A solid approach: run the application for a week with kubectl top pods and observe real consumption. Set the request at P50 usage, the limit at P95.

Custom metrics: beyond CPU

CPU is a blunt instrument. For an API server, requests per second is a far better scaling signal. With the Prometheus adapter, the HPA can scale on custom metrics:

metrics:
- type: Pods
  pods:
    metric:
      name: http_requests_per_second
    target:
      type: AverageValue
      averageValue: 100

This requires the application to export metrics and the Prometheus adapter to be running. More setup work, but the difference in scaling accuracy is significant. CPU-based scaling is reactive — CPU rises only after the requests are already hitting. Request-based scaling responds much faster.

VPA: bigger pods instead of more pods

Not every workload scales horizontally. A database, a cache layer, a monolith that isn't stateless — adding more replicas doesn't help. That's where the Vertical Pod Autoscaler comes in. Instead of adding pods, the VPA makes existing ones larger.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: worker-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: background-worker
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: worker
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 2
        memory: 4Gi

One critical detail: VPA in Auto mode restarts pods to apply the new resource limits. That means brief downtime unless multiple replicas are running. In many cases, updateMode: "Off" is the safer choice — the VPA only provides recommendations that can be applied manually.

To check the recommendations:

kubectl describe vpa worker-vpa

This shows lower bound, target, and upper bound for CPU and memory. Useful as a starting point for fine-tuning resource requests.

Combining HPA and VPA: does it work?

Short answer: yes, but not on the same metric. If the HPA scales on CPU and the VPA also adjusts CPU, a feedback loop emerges that goes nowhere useful. The solution:

HPA scales on custom metrics (requests per second, queue depth)
VPA manages CPU and memory requests

Alternatively, the Multidimensional Pod Autoscaler (MPA) combines both strategies without conflicts, if available in the cluster.

Common mistakes

Mistake	Consequence	Fix
No resource requests	HPA doesn't function	Always set requests based on actual consumption
minReplicas set to 1	Single point of failure at low traffic	Minimum of 2 for production
No scaleDown stabilization	Flapping: pods constantly starting and stopping	stabilizationWindowSeconds of 300+
VPA on Auto without replicas	Downtime on every adjustment	Use "Off" mode or ensure multiple replicas
Limits far above requests	Node overcommitment, OOM kills	Keep limits close to requests (max 2x)

Monitoring your autoscaler

Without monitoring, autoscaling is a black box. A few essential checks:

# Current status of all HPAs
kubectl get hpa -A

# Detailed view with events
kubectl describe hpa api-hpa -n production

# Resource usage per pod
kubectl top pods -n production --sort-by=cpu

Set up alerts for situations where the HPA hits maxReplicas. That means the application needs more capacity than the cluster can provide — time to either raise maxReplicas or optimize the application itself.

When autoscaling is overkill

Not everything needs to scale dynamically. An internal tool with five users? Just run two fixed replicas. A cronjob that fires once an hour? Fixed resources. Autoscaling adds complexity that needs to earn its place.

The sweet spot: start with fixed replicas and well-tuned resource requests. Measure actual consumption over a few weeks. Then add autoscaling, using that data as the basis for thresholds. Evidence-based scaling beats guessing every time.

Kubernetes Autoscaling: HPA and VPA in Practice

Kubernetes Autoscaling: HPA and VPA in Practice

HPA: more pods under more load

Resource requests: get the basics right

Custom metrics: beyond CPU

VPA: bigger pods instead of more pods

Combining HPA and VPA: does it work?

Common mistakes

Monitoring your autoscaler

When autoscaling is overkill

Related Articles

Entity Framework Core Performance: Optimizing Queries Without Compromise

GitHub Actions Caching: 3x Faster Pipelines Without Extra Costs

Monitoring on a Budget: Cost Control Without Blind Spots

Want to stay updated?