Objectives in this domain

Serve models for batch and online inference using Agent Platform, Model Garden, Cloud Run, and GKE, packaging models from frameworks such as PyTorch and XGBoost with prebuilt and custom containers, versioning in the Model Registry, implementing rollout strategies such as A/B testing and canary deployments, and handling inference pre- and postprocessing.
Section 4.1hard
Scale online model serving by managing and serving features with the Feature Store, deploying to public and private endpoints, choosing CPU, GPU, TPU, and edge hardware, scaling the serving backend for throughput, and tuning models for production training and serving.
Section 4.2medium

Sample question from this domain

Free sampleServing and Scaling Modelsmedium

A team serves a fraud-scoring model on a Vertex AI dedicated endpoint. Traffic is near zero overnight but spikes sharply at 09:00 each weekday, and the first burst of requests times out before capacity catches up. They want the endpoint to absorb the morning spike without paying for idle replicas overnight. Which configuration change best meets both goals?

ASet minReplicaCount to a higher fixed value equal to the peak so replicas are always provisioned for the spike.
BDisable autoscaling and run a single large machine type so one replica handles all traffic at any hour.
CLeave the defaults but add a client-side retry loop so timed-out morning requests are resubmitted until replicas are ready.
DKeep a small minReplicaCount above zero, raise maxReplicaCount well above peak, and lower the target utilisation so scale-out triggers earlier. Correct

Tune Vertex AI endpoint autoscaling by combining a warm minimum floor, a high ceiling, and an earlier scale-out trigger to absorb spikes economically. Autoscaling reacts to a utilisation signal, so a non-zero minimum keeps warm replicas that answer the first burst while a lower utilisation target makes the scaler add replicas before the request queue saturates, and a high maximum lets it reach peak; together they cover the spike without holding peak capacity overnight.

Why A is wrong: Pinning the floor to peak capacity does remove the cold-start delay, but it keeps every peak replica running through the idle overnight window, which directly contradicts the requirement to avoid paying for idle capacity.

Why B is wrong: A single oversized replica looks simpler and avoids scaling lag, but one replica cannot scale out for a sharp concurrent spike and becomes a throughput bottleneck during the 09:00 burst.

Why C is wrong: Client retries can mask occasional failures, but they add load to an already saturating endpoint during scale-out and do nothing to provision capacity earlier, so the morning timeouts persist.

Why D is correct: A small warm floor avoids the cold-start timeouts on the first burst, the high ceiling lets the endpoint scale out for the spike, and a lower utilisation target makes the autoscaler add replicas before the queue saturates.

Other domains in this exam

Architecting Low-Code AI Solutions12% of the exam
Collaborating Within and Across Teams to Manage Data and Models16% of the exam
Scaling Prototypes Into ML Models21% of the exam
Automating and Orchestrating ML Pipelines18% of the exam
Monitoring AI Solutions13% of the exam

See also the PMLE cert hub, the study guide, and the cheat sheet.