gpu-workload/t5/kubernetes/hpa.yaml (19 lines of code) (raw):
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: t5-inference
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: t5-inference
minReplicas: 1
maxReplicas: 5
metrics:
- type: Pods
pods:
metric:
name: prometheus.googleapis.com|ts_queue_latency_microseconds|counter
target:
type: AverageValue
averageValue: "7000000"