PMLE domain - 20% of the exam

Serving and Scaling Models

Serving and Scaling Models is 20% of the Google Cloud Professional Machine Learning Engineer (PMLE) exam. These are the objectives it covers, each with practice questions and worked explanations.

Objectives in this domain

Sample question from this domain

Free sampleServing and Scaling Modelsmedium

A team serves a fraud-scoring model on a Vertex AI dedicated endpoint. Traffic is near zero overnight but spikes sharply at 09:00 each weekday, and the first burst of requests times out before capacity catches up. They want the endpoint to absorb the morning spike without paying for idle replicas overnight. Which configuration change best meets both goals?

  • ASet minReplicaCount to a higher fixed value equal to the peak so replicas are always provisioned for the spike.
  • BDisable autoscaling and run a single large machine type so one replica handles all traffic at any hour.
  • CLeave the defaults but add a client-side retry loop so timed-out morning requests are resubmitted until replicas are ready.
  • DKeep a small minReplicaCount above zero, raise maxReplicaCount well above peak, and lower the target utilisation so scale-out triggers earlier. Correct
Tune Vertex AI endpoint autoscaling by combining a warm minimum floor, a high ceiling, and an earlier scale-out trigger to absorb spikes economically. Autoscaling reacts to a utilisation signal, so a non-zero minimum keeps warm replicas that answer the first burst while a lower utilisation target makes the scaler add replicas before the request queue saturates, and a high maximum lets it reach peak; together they cover the spike without holding peak capacity overnight.

Why A is wrong: Pinning the floor to peak capacity does remove the cold-start delay, but it keeps every peak replica running through the idle overnight window, which directly contradicts the requirement to avoid paying for idle capacity.

Why B is wrong: A single oversized replica looks simpler and avoids scaling lag, but one replica cannot scale out for a sharp concurrent spike and becomes a throughput bottleneck during the 09:00 burst.

Why C is wrong: Client retries can mask occasional failures, but they add load to an already saturating endpoint during scale-out and do nothing to provision capacity earlier, so the morning timeouts persist.

Why D is correct: A small warm floor avoids the cold-start timeouts on the first burst, the high ceiling lets the endpoint scale out for the spike, and a lower utilisation target makes the autoscaler add replicas before the queue saturates.

Other domains in this exam

See also the PMLE cert hub, the study guide, and the cheat sheet.

Examworthy is not affiliated with or endorsed by Google Cloud. Original, blueprint-aligned practice material only.