A team serves a fraud-scoring model on a Vertex AI dedicated endpoint. Traffic is near zero overnight but spikes sharply at 09:00 each weekday, and the first burst of requests times out before capacity catches up. They want the endpoint to absorb the morning spike without paying for idle replicas overnight. Which configuration change best meets both goals?
- ASet minReplicaCount to a higher fixed value equal to the peak so replicas are always provisioned for the spike.
- BDisable autoscaling and run a single large machine type so one replica handles all traffic at any hour.
- CLeave the defaults but add a client-side retry loop so timed-out morning requests are resubmitted until replicas are ready.
- DKeep a small minReplicaCount above zero, raise maxReplicaCount well above peak, and lower the target utilisation so scale-out triggers earlier. Correct
Why A is wrong: Pinning the floor to peak capacity does remove the cold-start delay, but it keeps every peak replica running through the idle overnight window, which directly contradicts the requirement to avoid paying for idle capacity.
Why B is wrong: A single oversized replica looks simpler and avoids scaling lag, but one replica cannot scale out for a sharp concurrent spike and becomes a throughput bottleneck during the 09:00 burst.
Why C is wrong: Client retries can mask occasional failures, but they add load to an already saturating endpoint during scale-out and do nothing to provision capacity earlier, so the morning timeouts persist.
Why D is correct: A small warm floor avoids the cold-start timeouts on the first burst, the high ceiling lets the endpoint scale out for the spike, and a lower utilisation target makes the autoscaler add replicas before the queue saturates.