Kubernetes Autoscaling: HPA and VPA in Practice
Setting up Horizontal and Vertical Pod Autoscaling without blowing up your cluster. Resource requests, metrics, and the pitfalls of scaling.
Jean-Pierre Broeders
Freelance DevOps Engineer
Kubernetes Autoscaling: HPA and VPA in Practice
Manually bumping replica counts during traffic spikes. It works, until that one time nobody notices until the alerts fire. Or worse — nobody scales back down and the cloud bill doubles. Kubernetes ships with two autoscaling mechanisms: the Horizontal Pod Autoscaler (HPA) and the Vertical Pod Autoscaler (VPA). Both solve different problems, and they work in fundamentally different ways.
HPA: more pods under more load
The HPA is the more widely known option. The concept is straightforward: when CPU or memory crosses a threshold, new pods spin up. When load drops, they get removed.
A basic HPA configuration:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Pods
value: 2
periodSeconds: 60
That behavior block matters more than most people realize. Without it, the HPA's default scaleDown is aggressive — half the pods can vanish within a minute after a brief traffic dip. A 300-second stabilization window prevents that kind of thrashing.
Resource requests: get the basics right
An HPA that scales on CPU utilization measures against the resource request. No requests defined? The HPA simply doesn't work. This is by far the most common mistake.
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
Those requests need to be realistic. Set them too low and the HPA triggers constantly. Set them too high and scaling never kicks in despite actual CPU pressure. A solid approach: run the application for a week with kubectl top pods and observe real consumption. Set the request at P50 usage, the limit at P95.
Custom metrics: beyond CPU
CPU is a blunt instrument. For an API server, requests per second is a far better scaling signal. With the Prometheus adapter, the HPA can scale on custom metrics:
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: 100
This requires the application to export metrics and the Prometheus adapter to be running. More setup work, but the difference in scaling accuracy is significant. CPU-based scaling is reactive — CPU rises only after the requests are already hitting. Request-based scaling responds much faster.
VPA: bigger pods instead of more pods
Not every workload scales horizontally. A database, a cache layer, a monolith that isn't stateless — adding more replicas doesn't help. That's where the Vertical Pod Autoscaler comes in. Instead of adding pods, the VPA makes existing ones larger.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: worker-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: background-worker
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: worker
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 2
memory: 4Gi
One critical detail: VPA in Auto mode restarts pods to apply the new resource limits. That means brief downtime unless multiple replicas are running. In many cases, updateMode: "Off" is the safer choice — the VPA only provides recommendations that can be applied manually.
To check the recommendations:
kubectl describe vpa worker-vpa
This shows lower bound, target, and upper bound for CPU and memory. Useful as a starting point for fine-tuning resource requests.
Combining HPA and VPA: does it work?
Short answer: yes, but not on the same metric. If the HPA scales on CPU and the VPA also adjusts CPU, a feedback loop emerges that goes nowhere useful. The solution:
- HPA scales on custom metrics (requests per second, queue depth)
- VPA manages CPU and memory requests
Alternatively, the Multidimensional Pod Autoscaler (MPA) combines both strategies without conflicts, if available in the cluster.
Common mistakes
| Mistake | Consequence | Fix |
|---|---|---|
| No resource requests | HPA doesn't function | Always set requests based on actual consumption |
| minReplicas set to 1 | Single point of failure at low traffic | Minimum of 2 for production |
| No scaleDown stabilization | Flapping: pods constantly starting and stopping | stabilizationWindowSeconds of 300+ |
| VPA on Auto without replicas | Downtime on every adjustment | Use "Off" mode or ensure multiple replicas |
| Limits far above requests | Node overcommitment, OOM kills | Keep limits close to requests (max 2x) |
Monitoring your autoscaler
Without monitoring, autoscaling is a black box. A few essential checks:
# Current status of all HPAs
kubectl get hpa -A
# Detailed view with events
kubectl describe hpa api-hpa -n production
# Resource usage per pod
kubectl top pods -n production --sort-by=cpu
Set up alerts for situations where the HPA hits maxReplicas. That means the application needs more capacity than the cluster can provide — time to either raise maxReplicas or optimize the application itself.
When autoscaling is overkill
Not everything needs to scale dynamically. An internal tool with five users? Just run two fixed replicas. A cronjob that fires once an hour? Fixed resources. Autoscaling adds complexity that needs to earn its place.
The sweet spot: start with fixed replicas and well-tuned resource requests. Measure actual consumption over a few weeks. Then add autoscaling, using that data as the basis for thresholds. Evidence-based scaling beats guessing every time.
