Google Cloud study guide

How to pass Google Cloud Professional Machine Learning Engineer (PMLE)

23 min read6 domains coveredFree practice, no sign-up

The Google Cloud Professional Machine Learning Engineer (PMLE) tests whether you can take an ML or generative AI problem from a rough prototype to a monitored production system, and choose the right Google Cloud product at every step. Google hands you a business scenario with constraints on cost, latency, scale, team skill set, and how much operational work the team can absorb, then asks which approach fits. The hard part is rarely knowing what a service does; it is knowing which one wins when BigQuery ML, AutoML, a custom Vertex AI training job, and a tuned foundation model could all plausibly solve the task and only one matches every constraint named in the scenario.

It suits practitioners who already build and ship ML: ML engineers, data scientists moving models into production, and software engineers who own model serving and pipelines on Google Cloud. The exam spans six domains that run the full lifecycle, from low-code AutoML and foundation-model solutions, through data and experiment collaboration, scaling and training, serving, pipeline automation, and finally monitoring and responsible AI. There is no formal prerequisite, but the questions assume real exposure to Vertex AI, BigQuery ML, the accelerator choices, and the trade-offs between them.

The exam rewards decision rules, not feature recall. Most questions are short scenarios where two or three answers are technically capable and only one is the best fit once you weigh the constraint that was named: the lowest cost, the lowest latency, the least code to maintain, a SQL-only team, or a model that no longer fits on one accelerator. The current blueprint also leans hard into generative AI, so context caching, Gemini fine-tuning from BigQuery, the Gen AI evaluation service, and Model Armor now sit alongside the classic train-serve-monitor material. Practising on scenario questions with a worked explanation, and a reason every wrong option is wrong, beats memorising service datasheets.

PMLE is a pick-the-right-approach exam across the whole ML lifecycle: nearly every question is a scenario with cost, latency, skill, and scale constraints, and the right answer is the Google Cloud service or setting that fits them, usually the most managed option that meets the requirement.

Difficulty

Advanced

Best for

Working ML practitioners: ML engineers, data scientists productionising models, and software engineers who own serving, pipelines, and monitoring on Google Cloud, who need to prove they can take a model from prototype to monitored production and choose the right Vertex AI and BigQuery ML tooling under real constraints.

Prerequisites

None enforced. Google recommends around three years of industry experience including a year or more designing and managing ML solutions on Google Cloud. Hands-on exposure to Vertex AI, BigQuery ML, foundation-model tuning, and the accelerator and serving choices is what actually carries you.

50 to 60

Questions

120 min

Time allowed

$200

Exam cost (USD)

230

Practice questions

How this exam thinks

One habit decides this exam: read the scenario for its constraint, then pick the approach that fits it. Almost every question is a short business situation with a stated limit on cost, latency, scale, team skill, or operational overhead, and the answer is the Google Cloud product, model type, or setting that meets that limit. Several options will be technically capable. Only one is the best fit once you weigh what the scenario actually asked for.

The default tie-breaker is the most managed option that meets the requirement. Google designs the exam around its own preference for managed services, so when two answers both work, the one with less to build and run usually wins: BigQuery ML for a SQL-fluent team whose data already lives in the warehouse, AutoML over a hand-built network when there is no reason to go custom, a tuned foundation model over training from scratch, batch prediction over standing up an always-on endpoint for a nightly bulk job. Reach for the lower-level option only when the scenario names a reason, such as a custom architecture, an existing framework, or a model too large for one accelerator. That named reason is the signal that the obvious managed answer is the trap.

The rest is a handful of discriminations the exam leans on, each driven by the constraint in the scenario. Model parallelism when the model will not fit on one device, data parallelism when it fits but training is slow; online endpoints for low-latency single requests, batch prediction for high-throughput bulk scoring; Experiments to compare ad hoc runs, Pipelines to orchestrate a repeatable workflow; performance or drift-threshold retraining when degradation is unpredictable, a schedule only when it is periodic. Name the constraint, then choose the service or setting built for it.

What each domain tests and how to study it

The PMLE blueprint is split across 6 domains. Weights are the official share of the exam; see the official exam guide for the authoritative breakdown.

Architecting Low-Code AI Solutions
12% of exam
What you must be able to do. Given a business problem and a team's skill level, choose the low-code or foundation-model approach that solves it at the lowest cost and effort, and configure Gemini, BigQuery ML, and the Gen AI evaluation service to fit.
In one sentenceThe low-code layer: solving ML and generative AI tasks with BigQuery ML, AutoML, and managed AI APIs before anyone writes custom training code.
Recall check: answer these from memory first
- A SQL-fluent team wants Gemini to learn their house style without leaving BigQuery and serve rewrites from a query. Which two BigQuery ML statements do this, and why is an exported custom training job wrong?
- Which Gemini request setting cuts both cost and latency when the same long instruction prefix repeats on every call, and why do temperature or maxOutputTokens changes not?
- You have shortlisted three Model Garden foundation models for a triage feature. How do you get a comparable, quantitative score per candidate on your own labelled data before deploying any of them?
What it tests. Solving problems without building models from scratch. Developing models with BigQuery ML and AutoML for classification, regression, forecasting, and clustering, including feature engineering and fine-tuning Gemini directly from BigQuery; and building AI solutions with Google Cloud AI APIs and foundation models, selecting from Model Garden, using prebuilt APIs such as Document AI, Vision, and Translate, tuning Gemini, Imagen, and Veo, and optimising generative calls for cost, latency, and availability with techniques such as context caching.
How to study it. Drill the low-code decision: when a SQL-fluent team's data already lives in BigQuery, the answer is usually BigQuery ML, including supervised fine-tuning of a Gemini remote model with CREATE MODEL and serving with ML.GENERATE_TEXT, never an exported custom training job. Learn the generative-AI cost and latency levers as named answers: context caching for a long repeated prompt prefix, model selection from Model Garden by task fit, and the Gen AI evaluation service to compare candidates on your own labelled data before deploying. Practise reading a scenario and naming the single managed technique that fits the team and the constraint.
Easy to confuse
- Fine-tuning Gemini from BigQuery versus training a BigQuery ML classifier. A remote Gemini model with supervised fine-tuning learns generative or labelling conventions from message-and-label pairs and keeps the model's language ability; a BigQuery ML logistic regression on bag-of-words is a separate classical classifier that discards the foundation model. Choose fine-tuning when the scenario wants the team's conventions inside a Gemini call.
- Context caching versus raising maxOutputTokens. Context caching persists a repeated input prefix server-side so its tokens are billed at a reduced rate and not recomputed, cutting input cost and time-to-first-token; maxOutputTokens only caps how much is generated and does nothing about a repeated prompt prefix. The constraint that decides it is a long shared prefix, not output length.
Worked example from the PMLE bank
Free sampleArchitecting Low-Code AI Solutionsmedium
A marketing analytics team holds 8,000 historical product blurbs in BigQuery, each paired with an approved on-brand rewrite that follows their house style for tone and length. A prompt-only Gemini call in BigQuery produces rewrites that are accurate but consistently miss the house tone. The team is fluent in SQL, wants the model to internalise their style without leaving BigQuery, and wants future rewrites generated directly from a SQL query. Which approach should they take?
- AKeep the base Gemini model and craft a longer ML.GENERATE_TEXT prompt that embeds a dozen example rewrites inline, so the model copies the tone from the few-shot examples on every call.
- BRun supervised fine-tuning of a Gemini model in BigQuery with CREATE MODEL using the paired blurbs and rewrites as the training table, then call the tuned model with ML.GENERATE_TEXT. Correct
- CTrain a BigQuery ML logistic regression model on the paired text columns so it learns to map each blurb to its rewritten form during training.
- DExport the paired blurbs to Cloud Storage and run a custom Vertex AI training job to fine-tune Gemini, then import the result back into BigQuery for inference.
BigQuery ML can supervised-fine-tune a Gemini model from a labelled table using CREATE MODEL, then serve generations with ML.GENERATE_TEXT without leaving the warehouse. Supervised fine-tuning adjusts the Gemini model's weights from paired prompt-response examples so it reproduces the house tone reliably, and BigQuery ML exposes this through CREATE MODEL and ML.GENERATE_TEXT so the SQL-fluent team never moves the data.
Why A is wrong: Tempting because few-shot prompting can nudge tone without training, but it does not internalise style from 8,000 examples, repeats the long prompt on every call raising cost, and is brittle compared with fine-tuning when consistent style is the goal.
Why B is correct: Correct: BigQuery ML supports supervised fine-tuning of Gemini via CREATE MODEL over a labelled prompt-and-response table, and the tuned endpoint is then queried with ML.GENERATE_TEXT, all inside BigQuery, so the model learns the house style from the paired examples.
Why C is wrong: Tempting because logistic regression is a familiar BigQuery ML model, but it is a classifier that predicts discrete labels and cannot generate free-form rewritten text, so it cannot produce styled blurbs at all.
Why D is wrong: Tempting because custom training can fine-tune models, but it leaves BigQuery, adds pipeline overhead the SQL-only team wants to avoid, and is unnecessary because BigQuery ML can fine-tune Gemini in place for this task.
Collaborating Within and Across Teams to Manage Data and Models
16% of exam
What you must be able to do. Given a preprocessing, prototyping, or experiment-tracking need, choose the right tool for the data scale and the stage of work, protect PII, and evaluate models with the metric that fits the data.
In one sentenceThe collaboration layer: preprocessing data at the right scale, prototyping safely in notebooks, and tracking experiments and lineage so a team can reproduce its work.
Recall check: answer these from memory first
- A 4 TB bounded join with window functions, a SQL team, and no desire to run a cluster: which preprocessing engine, and why not Dataflow or Spark on Dataproc?
- Positives are 2 percent of rows. Which metric ranks candidate models faithfully on the rare class, and why does overall accuracy mislead here?
- Distinguish an Experiment from Vertex ML Metadata in one line each: which one compares runs, and which one durably stores artifacts, executions, and lineage?
What it tests. Working as a team across data and models. Exploring and preprocessing tabular, text, and image data by choosing the right scale tool among BigQuery, Dataflow, Apache Spark, and in-memory Python, consolidating features in the Feature Store, and protecting personally identifiable information; prototyping in Vertex AI Workbench and Colab Enterprise with PyTorch, scikit-learn, and JAX; and tracking experiments and runs, choosing among Experiments, Vertex AI Pipelines, and Kubeflow Pipelines, evaluating predictive and generative solutions including LLM-as-a-judge, and tracking artifacts, versions, and lineage with ML Metadata.
How to study it. Build a scale-to-tool map for preprocessing and drill it: large bounded SQL joins and window aggregations with no cluster to manage go to BigQuery; streaming or complex Beam transforms go to Dataflow; an existing open-source estate goes to Spark on Dataproc; small in-memory work goes to pandas. Learn the experiment-tracking split: Experiments compares ad hoc runs by their parameters and metrics, Pipelines orchestrates repeatable multi-step workflows, and ML Metadata is the persistent store of artifacts, executions, and lineage beneath both. Lock the evaluation metrics, especially that imbalanced data needs precision-recall focused measures, not accuracy.
Easy to confuse
- Experiments versus Vertex ML Metadata. An Experiment groups related runs so their parameters and metrics can be compared for analysis; ML Metadata is the backing store that durably records artifacts, executions, and the lineage edges between them. The Experiment is the comparison view, ML Metadata is the persistent graph beneath it, so they complement rather than duplicate each other.
- Accuracy versus area under the precision-recall curve on imbalanced data. Accuracy is dominated by correct majority predictions, so a degenerate model that always predicts the common class scores high; area under the precision-recall curve derives from positive-class outcomes and measures minority detection across thresholds. When the scenario names a rare positive class, accuracy is the trap and the precision-recall measure is the answer.
Worked example from the PMLE bank
Free sampleCollaborating Within and Across Teams to Manage Data and Modelsmedium
A medical-screening team trains a classifier on a dataset where positive cases make up 2 percent of rows. During model selection they currently rank candidate models by overall accuracy, and a reviewer points out that this metric is misleading on such skewed data because a trivial model that predicts the negative class scores well. Which evaluation choice gives the team a more faithful comparison of how well each model identifies the rare positive cases?
- AContinue ranking by accuracy but raise the acceptance threshold so only very confident predictions count.
- BRank models by training-set log loss measured before any validation split is taken.
- CRank models by total prediction count to confirm each model produces an output for every row.
- DRank models by the area under the precision-recall curve, which focuses on performance on the rare positive class across thresholds. Correct
On highly imbalanced data, evaluate models with precision-recall focused metrics rather than overall accuracy, which is inflated by the majority class. When positives are rare, accuracy is dominated by correct majority predictions, so a degenerate negative-only model scores high. Precision and recall derive from positive-class outcomes, so the area under the precision-recall curve measures minority-class detection across thresholds.
Why A is wrong: Tweaking the threshold does change the operating point, which makes it sound responsive to imbalance, but accuracy stays dominated by the majority class regardless of threshold, so the ranking remains misleading on rare positives.
Why B is wrong: Log loss does penalise confident mistakes, which makes it look like a richer metric, but measuring it on the training set before splitting reports fit to seen data, not generalisation, so it cannot fairly compare models on the rare class.
Why C is wrong: Confirming an output per row is a reasonable sanity check, which is why it might be picked, but prediction count measures coverage, not correctness, and says nothing about how well the rare positives are identified.
Why D is correct: Precision and recall are computed from positive-class outcomes, so the precision-recall curve directly reflects how well each model catches rare cases without rewarding majority-class guessing, giving a faithful comparison on skewed data.
Scaling Prototypes Into ML Models
21% of exam
What you must be able to do. Given cost, complexity, latency, and scale constraints, choose the model type, the Google Cloud training product, and the accelerator and distribution strategy that fit, and diagnose training that does not fit or does not scale.
In one sentenceThe heaviest domain: turning a prototype into a trained model by picking the model type, the training product, and the right hardware and parallelism strategy.
Recall check: answer these from memory first
- A model's parameters and activations no longer fit on one accelerator even at batch size one. Which distribution strategy directly fixes this, and why does adding more data-parallel replicas not?
- A SQL team needs a univariate nightly forecast for 2,000 stores, results staying in BigQuery, small budget. Which model type and product, and why not a single DNN or an LLM prompt?
- Why does a TPU pod often sustain higher scaling efficiency than an equivalent GPU cluster on a synchronous data-parallel workload?
What it tests. Scaling a working idea into a trained model. Choosing the model type and product among ARIMA, DNN, and LLM and among AutoML, BigQuery ML, and Vertex AI Pipelines given cost, complexity, latency, and scalability, plus deployment and interpretability strategy; training by organising data on Cloud Storage and BigQuery, using custom training, Kubeflow on GKE, AutoML, and Tabular Workflows, troubleshooting failures, tuning hyperparameters, and fine-tuning foundation models; and choosing hardware among CPU, GPU, and TPU and applying data and model parallelism for distributed training.
How to study it. This is the biggest domain by weight, so spend the most time here. Fix the model-type-by-constraint calls until automatic: univariate forecasting for a SQL team that keeps results in the warehouse is ARIMA in BigQuery ML; many non-linearly interacting drivers point to boosted trees or a DNN over ARIMA. Lock the parallelism rule cold: model parallelism when the model exceeds one accelerator's memory, data parallelism when it fits but training is slow. Learn why a TPU pod scales synchronous training better than separate GPU hosts (its dedicated high-bandwidth inter-chip interconnect carries the gradient all-reduce), and practise the hardware choice from the workload.
Easy to confuse
- Model parallelism versus data parallelism. Model parallelism partitions a network's layers or tensors across devices so a model larger than one device's memory can still train; data parallelism replicates the whole model and splits the batch to train faster but does not raise the memory available to a single replica. The deciding constraint is whether the model fits on one device at all.
- ARIMA versus boosted trees or DNN for forecasting. ARIMA models a univariate series through its own autocorrelation and admits extra regressors only in a limited additive form; boosted trees and DNNs natively learn non-linear interactions among many exogenous drivers. Choose ARIMA for a single clean series, and a tree or neural learner when promotions, pricing, and weather interact non-linearly.
Worked example from the PMLE bank
Free sampleScaling Prototypes Into ML Modelsmedium
A team's model has grown so large that its parameters and intermediate activations no longer fit within a single accelerator's memory, even with a batch size of one. Which distributed training approach directly addresses this constraint?
- AModel parallelism, because partitioning the model's layers or tensors across devices lets a model larger than one device's memory still be trained. Correct
- BData parallelism, because replicating the model across more devices increases the total memory available to each replica.
- CIncreasing the global batch size, because larger batches spread the parameter memory more evenly across the training step.
- DSwitching from a GPU to a CPU host, because system RAM is larger and can therefore store the entire oversized model for training.
Recognise that model parallelism, not data parallelism, is needed when a model exceeds a single device's memory. Model parallelism partitions the network's tensors or layers across devices so each device stores and computes only a fragment, allowing a model whose memory footprint exceeds one accelerator to be trained across several.
Why A is correct: Model parallelism splits the parameters and activations across multiple devices so no single device must hold the whole model, which is exactly what is required when a model exceeds one accelerator's memory.
Why B is wrong: Adding replicas raises aggregate throughput and is the common first scaling step, which makes it tempting, but each replica still holds the entire model, so it does nothing to fit a model too large for one device.
Why C is wrong: A larger batch can improve hardware utilisation and sounds like it shares load, but batch size does not reduce parameter memory and in fact raises activation memory, worsening the constraint.
Why D is wrong: CPU hosts do have more RAM, which makes this seem like a memory fix, but training the model on a CPU would be prohibitively slow and abandons the accelerator throughput needed for large models.
Serving and Scaling Models
20% of exam
What you must be able to do. Given a serving workload's latency, throughput, and traffic shape, choose between online endpoints and batch prediction, configure autoscaling and hardware, and fix container and routing faults.
In one sentenceThe serving layer: choosing online versus batch, packaging the container correctly, and tuning autoscaling and hardware so production traffic is served within objective at a sensible cost.
Recall check: answer these from memory first
- A nightly bulk score of 400 million rows, no real-time requirement, cost and throughput matter. Which serving approach, and why is streaming rows through an online endpoint wrong?
- An endpoint breaches p99 latency by the time GPU utilisation hits the 80 percent target. Which single autoscaling change fixes it, and which direction do you move the target?
- A custom serving container returns HTTP 404 on every prediction while the health probe passes. What is the most likely cause and the first fix?
What it tests. Serving models in production. Serving for batch and online inference using Vertex AI, Model Garden, Cloud Run, and GKE, packaging models from frameworks such as PyTorch and XGBoost with prebuilt and custom containers, versioning in the Model Registry, implementing A/B and canary rollouts, and handling pre- and postprocessing; and scaling online serving with the Feature Store, public and private endpoints, the right CPU, GPU, TPU, or edge hardware, scaling the backend for throughput, and tuning models for production.
How to study it. Drill the online-versus-batch call: a large periodic bulk scoring job with no per-row latency need is a distributed batch prediction job, not an always-on endpoint; low-latency single requests need an online endpoint. Learn endpoint autoscaling as a set of named levers: a warm non-zero minimum floor to answer the first burst, a high maximum to reach peak, and a lower target utilisation so scale-out fires before replicas saturate and latency breaches. Learn the custom-container contract: the request handler must bind the exact declared predictRoute or Vertex AI returns 404 even with a healthy probe. Practise matching each fault and traffic shape to its fix.
Easy to confuse
- Online endpoint versus batch prediction. An online endpoint is optimised for low-latency single requests and wastes resources on bulk loads; a batch prediction job distributes scoring across workers and reads and writes bulk storage directly, so it scores very large datasets at high throughput and low cost. The deciding constraint is whether per-row latency matters or whole-dataset throughput does.
- Raising versus lowering the autoscaling target utilisation. Raising the target packs each replica fuller before adding capacity, which makes latency worse because scale-out fires after replicas are already saturated; lowering the target adds replicas while each still has headroom to absorb load during the warm-up delay. To protect latency under rising load you lower the target, not raise it.
Worked example from the PMLE bank
Free sampleServing and Scaling Modelshard
A retailer scores its entire 400 million row customer table with a PyTorch model once each night and has no requirement to serve individual real-time requests during the day. Throughput and cost efficiency over the whole dataset matter; sub-second per-row latency does not. Which serving approach best fits this workload?
- ADeploy the model to an always-on Vertex AI online endpoint and stream each of the 400 million rows through individual prediction requests overnight.
- BHost the model on a single Cloud Run instance and send the whole table in one request body so it is processed in a single invocation.
- CSchedule a GKE deployment with a single replica that pulls rows one at a time from a queue and stores each prediction as it finishes.
- DRun a Vertex AI batch prediction job that reads the table from storage, distributes scoring across workers, and writes results back, using a container that matches the model framework. Correct
Match a large periodic bulk scoring workload with no latency requirement to a distributed batch prediction job rather than online serving. Batch prediction distributes scoring across multiple workers and reads and writes bulk storage directly, so it processes very large datasets at high throughput and low cost, whereas online endpoints are optimised for low-latency single requests and waste resources on bulk loads.
Why A is wrong: An online endpoint can technically loop over every row, but paying for an always-on endpoint and issuing hundreds of millions of single requests is far less cost efficient than a distributed batch job for a bulk nightly load.
Why B is wrong: Cloud Run suits request-driven serving, but pushing 400 million rows in one request exceeds practical payload and memory limits on a single instance and offers no parallelism, so it cannot complete the nightly load reliably.
Why C is wrong: A single-replica GKE consumer is a plausible custom pipeline, but processing 400 million rows serially through one replica is slow and underuses the cluster, so it is far less efficient than a distributed batch job.
Why D is correct: A batch prediction job parallelises scoring across workers and reads and writes bulk data directly, which maximises throughput and cost efficiency for a large nightly dataset where per-row latency is irrelevant.
Automating and Orchestrating ML Pipelines
18% of exam
What you must be able to do. Given a need for repeatable training and retraining, choose the orchestration and trigger design that fits the drift pattern and the batch shape, and guarantee training-serving consistency.
In one sentenceThe automation layer: orchestrating end-to-end pipelines, triggering retraining on the right signal, and persisting preprocessing artefacts so serving never diverges from training.
Recall check: answer these from memory first
- Fraud patterns shift unpredictably and you want to avoid wasted compute. Which retraining policy fires only on genuine degradation, and why is a fixed monthly schedule wrong here?
- A nightly batch-prediction pipeline must apply the identical preprocessing the model trained with. What design guarantees parity, and why is sharing the same component that refits per run not enough?
- An upstream system writes each weekly batch as hundreds of small files and a per-object Cloud Build trigger launches hundreds of retrains. How do you get exactly one data-driven retrain per batch?
What it tests. Making the lifecycle repeatable. Developing end-to-end pipelines that validate data and models, orchestrating managed and unmanaged services with Vertex AI Pipelines, Managed Service for Apache Airflow, and Ray, and ensuring consistent preprocessing between training and serving; and automating retraining by choosing an appropriate retraining policy and deploying in CI, CD, and continuous-training pipelines with services such as Cloud Build.
How to study it. Learn the retraining-trigger rule: when degradation is unpredictable, fire on a monitored performance or input-drift threshold, not on a calendar; use a schedule only when drift is genuinely periodic. Lock train-serve parity as the recurring theme: the fix for skew is to persist the fitted transform or precomputed statistics as an artefact and reapply that exact artefact at serving and batch-scoring time, never to recompute statistics from live traffic or to share code that refits per run. Learn the practical Cloud Build trigger trap, gating a multi-file batch on a single end-of-batch sentinel object so it retrains exactly once. Practise matching each pattern to its design.
Easy to confuse
- Threshold-triggered retraining versus a fixed schedule. A monitored performance or drift threshold fires retraining only when the model is actually degrading, which fits unpredictable concept drift and conserves compute; a fixed calendar schedule retrains regardless and either wastes runs or lags real drift. Choose the threshold trigger when the scenario says degradation is unpredictable rather than periodic.
- Persisting the fitted transform as an artefact versus sharing a component that refits per run. Emitting the fitted transform from training and consuming that exact saved artefact at scoring time forces identical imputation, scaling, and vocabularies; calling the same component that refits on whatever data each pipeline sees produces different statistics and reintroduces skew. Parity comes from reusing the fitted artefact, not from sharing code or refitting.
Worked example from the PMLE bank
Free sampleAutomating and Orchestrating ML Pipelineshard
A team factors their feature preprocessing (imputation values, scaling statistics, and category vocabularies) into a single reusable pipeline component. They want the very same component, producing the identical fitted transform, to be used both in the training pipeline and in a nightly batch-prediction pipeline, so the offline scoring run cannot diverge from how the model was trained. Which design most directly guarantees that parity?
- APublish the preprocessing component to a shared registry and call the same component version in both pipelines, letting each pipeline fit the transform freshly on the data it sees at run time.
- BDocument the exact preprocessing steps in a shared README and require both pipeline authors to implement the same imputation, scaling, and encoding logic in their respective components.
- CHave the training component fit the transform and emit it as an output artefact, then have the batch-prediction pipeline consume that exact saved transform artefact rather than recomputing any statistics. Correct
- DRun the batch-prediction pipeline against a copy of the training dataset so that any statistics it recomputes are derived from the same rows the model was trained on.
Train-serve parity for batch inference comes from persisting the fitted transform as an artefact and reapplying it, not from sharing code or refitting per run. Skew arises when scoring recomputes preprocessing statistics from different data; emitting the fitted transform from training and consuming that exact artefact at scoring time forces identical imputation, scaling, and vocabularies across both pipelines.
Why A is wrong: Sharing the component version aligns the code, which is necessary but not sufficient: fitting freshly on each pipeline's own data yields different means and vocabularies, so the values applied at scoring still diverge from training.
Why B is wrong: Written documentation reduces drift in intent, but two separate implementations can still differ in rounding, defaults, or edge handling, so it offers no enforced guarantee that the fitted values match.
Why C is correct: Persisting the fitted transform as an artefact and reusing it at scoring time means both pipelines apply identical statistics and vocabularies, which is the structural guarantee against training-serving skew for batch inference.
Why D is wrong: Pointing scoring at the training rows would match the statistics only for that frozen dataset, yet batch scoring must run on new incoming data, where recomputed statistics would again differ from the training-time fit.
Monitoring AI Solutions
13% of exam
What you must be able to do. Given a production AI system, choose the managed control that secures it against adversarial input and PII leakage, the attribution method that explains it, and the monitoring and alerting setup that catches drift and skew in time.
In one sentenceThe monitoring and responsible-AI layer: securing generative systems with Model Armor, explaining predictions, and configuring Model Monitoring to detect and alert on drift and skew.
Recall check: answer these from memory first
- Prompt-injection inputs make an agent dump its system prompt. Which managed control screens incoming prompts for injection and jailbreak attempts without custom detection code?
- A regulator wants signed per-feature contributions for each declined applicant from a differentiable DNN, at the lowest compute per explanation. Which attribution method, and why not sampled Shapley?
- Model Monitoring computes drift scores correctly but on-call only sees breaches days later on the dashboard. Which single configuration change delivers breaches to the channel automatically?
What it tests. Keeping production AI safe, explainable, and observed. Identifying and mitigating risks with managed controls against data exfiltration, malicious prompting, and sensitive-data leakage using Model Armor, regular expressions, and safety filters, aligning with responsible AI including bias monitoring, and applying model explainability such as integrated gradients and sampled Shapley; and monitoring, testing, and troubleshooting with Model Monitoring for continuous evaluation, detecting training-serving skew, data drift, concept drift, and feature-attribution drift, and configuring notification channels.
How to study it. Learn the security and explainability tools by the job they answer. Model Armor is the managed screen for prompt injection and jailbreak attempts on LLM traffic, so reach for it instead of a brittle hand-written regex when the scenario wants a managed control. For per-instance attribution, integrated gradients suits a differentiable neural network at low compute, while sampled Shapley fits non-differentiable or tree models at higher cost. For monitoring, know that skew detection needs a training baseline dataset to compare against, and that a missed-breach problem is a notification gap fixed by attaching notification channels, not by tuning thresholds or sampling.
Easy to confuse
- Model Armor versus a hand-written regular expression. Model Armor is a managed platform control that inspects requests and responses for prompt injection, jailbreaks, and sensitive-data patterns without code to maintain; a hand-written regex blocks only the exact phrases you anticipated and is trivially bypassed by rewording. When the scenario asks for managed prompt-injection screening, Model Armor is the answer and the regex is the trap.
- Integrated gradients versus sampled Shapley. Integrated gradients accumulate a differentiable model's gradients along a path from a baseline to the input, giving signed per-feature contributions cheaply, so they suit neural networks; sampled Shapley estimates contributions by averaging over many feature-subset permutations and costs more, fitting non-differentiable or tree models. The deciding factors are whether the model is differentiable and the compute budget per explanation.
Worked example from the PMLE bank
Free sampleMonitoring AI Solutionsmedium
A bank serves a deep neural network on the Agent Platform that scores mortgage applications. A regulator asks the team to produce, for each declined applicant, the signed contribution of every input feature to that individual prediction, and the team wants the attribution method that suits a differentiable neural network with the lowest computation per explanation. Which feature-attribution method should they configure for these per-instance explanations?
- ASampled Shapley, which estimates each feature's contribution by averaging over many permutations of feature subsets fed through the model for every explanation.
- BExample-based explanations, which return the nearest labelled training examples to the input rather than a numeric contribution per feature.
- CPermutation feature importance, which shuffles each feature across the evaluation set and measures the resulting drop in overall model accuracy.
- DIntegrated gradients, which integrate the model's gradients along a path from a baseline to the input to attribute the prediction across features for this differentiable network. Correct
Integrated gradients is the per-instance attribution method suited to differentiable neural networks, using model gradients for low-cost feature contributions. Integrated gradients accumulate a differentiable model's gradients along a path from a baseline to the input, yielding signed per-feature contributions for each prediction; this fits a neural network and costs less than permutation-based Shapley sampling, satisfying the per-applicant justification at low compute.
Why A is wrong: Sampled Shapley is tempting because it also yields per-feature attributions, but it is the model-agnostic choice for non-differentiable models such as tree ensembles and needs many forward passes per instance, so it is heavier than necessary for a differentiable network.
Why B is wrong: Returning similar training examples sounds explanatory and is a real technique, but it gives neighbours rather than the signed per-feature contributions the regulator demands, so it does not satisfy the requirement.
Why C is wrong: Permutation importance is plausible because it ranks features, but it produces one global importance score per feature across a dataset, not a contribution for a single declined applicant, so it cannot justify an individual decision.
Why D is correct: Integrated gradients exploit the gradients of a differentiable neural network and give per-feature signed contributions for each instance at modest cost, matching both the model type and the low-compute requirement.

A study plan that works

Map the blueprint and book a date
Day 1
Read the official Google Cloud exam guide and the six domains with their weights. Book a provisional date now: a fixed date turns open-ended study into a plan and is the strongest predictor of actually sitting. Note that Scaling Prototypes (21 percent) and Serving (20 percent) are the two heaviest domains, with Automating Pipelines (18 percent) close behind.
Build the lifecycle decision map
Week 1
Before drilling any domain, build the decision rules the whole exam rests on: the low-code call (BigQuery ML versus AutoML versus a custom Vertex AI job versus a tuned foundation model), the parallelism call (model versus data), and the serving call (online endpoint versus batch prediction). Use the recall prompts in this guide: cover the answer, choose from the constraint, then reveal. If you cannot pick from the scenario alone, you do not own it yet.
Go deep on scaling and serving (Domains 3 and 4)
Weeks 1 to 3
These two are the largest by weight, so they get the most time. Drill model-type-by-constraint, the model-versus-data parallelism rule, the TPU pod scaling reason, the online-versus-batch call, and endpoint autoscaling levers. Practise on scenario questions and read the worked explanation on every one, including the ones you got right, watching for the named constraint that picks the answer.
Lock pipelines and collaboration (Domains 5 and 2)
Weeks 3 to 4
Pipelines rewards the retraining-trigger rule and the train-serve parity pattern of persisting and reapplying the fitted transform. Collaboration rewards the preprocessing scale-to-tool map, the Experiments-versus-ML-Metadata split, and the imbalanced-data metric. Do the harder calls by hand until the constraint alone decides them.
Cover low-code and monitoring (Domains 1 and 6)
Week 4
Low-code rewards the BigQuery ML and Gemini-from-BigQuery decisions plus the generative cost levers such as context caching and the Gen AI evaluation service. Monitoring rewards Model Armor for managed prompt screening, the integrated-gradients-versus-sampled-Shapley call, and the fact that a baseline dataset and notification channels are what make Model Monitoring useful. Both are dependable marks and tie straight back to named constraints.
Drill weak domains, then space the review
Week 5
Use your per-domain accuracy to attack the two domains dragging you down, not to re-read what you already know. Then space it: revisit each domain's recall prompts after a few days and again a week later. Spacing roughly doubles what sticks compared with cramming.
Sit a timed mock and calibrate
Weeks 5 to 6
Take at least one full timed mock under exam conditions to rehearse pacing and the flag-and-return habit across the question set in 120 minutes. Treat the score as a per-domain readiness signal, not a single number, and review every missed question, naming the constraint you misread, before you book or sit.

Know when you're ready

Readiness for the Google Cloud Professional Machine Learning Engineer is a score on scenario questions you have not seen before, not a feeling that the services are familiar. Those are different things, and the gap between them is where people fail. Re-reading the docs builds fluency, and fluency feels like knowledge, so confidence rises while real recall does not. The fix is to test yourself: if you can read a fresh scenario, name the constraint, and pick the right approach while explaining why each other option is wrong, you know it; if you can only nod along to an explanation, you do not yet.

Be especially wary of early confidence on the lifecycle map. Knowing what BigQuery ML, AutoML, Vertex AI training, endpoints, and Model Monitoring each do is the easy half; choosing between them under a cost, latency, skill, or scale constraint, when two of them would work, is the half the exam actually tests. The generative-AI material is newer and easy to under-prepare, so check that context caching, Gemini fine-tuning from BigQuery, the Gen AI evaluation service, and Model Armor are as solid as the classic train-serve-monitor calls. Trust your measured per-domain accuracy over your gut, and set the bar at clearing every domain comfortably on unseen questions across more than one session, not scraping a single pass.

This guide gives you the map. The practice bank is where you find out whether you can navigate it, with a worked explanation and a reason every distractor is wrong on every question. Readiness scoring tells you when you are there. Not before.

Ready to put this into practice?

Free PMLE questions with worked explanations. No sign-up.

Practise PMLE free

Exam-day tips

Read the scenario for its constraint first. The cost, latency, scale, team-skill, or operational-overhead limit named in the question is what picks the answer, so find it before you judge the options.
When two approaches both work, default to the most managed one. Google prefers managed services, so BigQuery ML for a SQL team, AutoML over a hand-built network, a tuned foundation model over training from scratch; reach lower only when the scenario names a reason.
Treat a SQL-fluent team whose data is already in BigQuery as a strong signal. It usually points to BigQuery ML, including fine-tuning a Gemini remote model and serving with ML.GENERATE_TEXT, over an exported custom training job.
Decide online versus batch from the latency requirement. A periodic bulk scoring job with no per-row latency need is a distributed batch prediction job; low-latency single requests need an online endpoint, never the reverse.
On the parallelism question, ask whether the model fits on one device. If it does not fit even at batch size one, it is model parallelism; if it fits but training is slow, it is data parallelism.
Watch for the responsible-AI trap. When a scenario wants managed prompt-injection or jailbreak screening, the answer is Model Armor, not a hand-written regex; and skew detection always needs a training baseline to compare against.
Flag and move on. Cover every question once before you spend time on a hard one; collecting the clear marks first in the 120 minutes protects the ones you actually know.

Frequently asked questions

Is the Google Cloud Professional Machine Learning Engineer hard?

It is an advanced, professional-level exam, and the difficulty is judgement rather than recall. Most questions are scenarios where several Google Cloud services or model types could work and only one fits the stated cost, latency, skill, or scale constraint. Scenario practice with worked explanations matters far more than memorising what each service does.

How long should I study for the PMLE?

Most candidates with real Google Cloud ML experience are ready in six to eight weeks of steady study. Less hands-on exposure means more time on the heavy domains, Scaling Prototypes and Serving, on the newer generative-AI material, and on the lifecycle decisions the whole exam rests on.

What is the pass mark for the PMLE?

Google does not publish an official passing score for its professional exams, and the result is reported as pass or fail. Because there is no public percentage to target, aim to clear every domain comfortably on unseen practice questions rather than chasing a raw figure.

How much generative AI is on the current exam?

A meaningful amount. The blueprint now spreads generative AI across the lifecycle: context caching and Gemini fine-tuning from BigQuery in the low-code domain, the Gen AI evaluation service and LLM-as-a-judge in collaboration, foundation-model tuning in scaling, and Model Armor in monitoring. Treat it as core material, not an afterthought.

Do I need to know how to code for this exam?

You need to read and reason about training scripts, SQL for BigQuery ML, container specs, and the parallelism strategies, but the exam is about choosing and configuring services, not writing programs from scratch. Comfort with Python ML frameworks, SQL, and how Vertex AI trains, serves, and monitors models is what carries you.

How much does the exam cost and how long is it?

The exam is 200 USD and runs for 120 minutes, with multiple-choice and multiple-select questions, as shown in the facts panel above. It is taken online-proctored from your own location or onsite at a test centre.

Which domains should I focus on?

Scaling Prototypes Into ML Models at 21 percent and Serving and Scaling Models at 20 percent are the two heaviest, so they deserve the most time, with Automating and Orchestrating Pipelines at 18 percent close behind. Do not leave Collaboration at 16 percent short either, since its preprocessing and experiment-tracking calls are dependable marks.

How many practice questions should I do before booking?

Enough that every domain clears comfortably on questions you have not seen, and a full timed mock feels comfortable on pacing. Quality of review beats raw volume: on every question, read the explanation and name the constraint that picked the answer, including on the ones you got right.

Is the Google Cloud Professional Machine Learning Engineer certification worth it?

The PMLE is worth it for ML engineers and data scientists whose work runs on Google Cloud, particularly those who design, build, or maintain Vertex AI pipelines, managed training jobs, or production model serving. It is a professional-level credential that covers the full lifecycle from low-code and BigQuery ML through to distributed training, serving optimisation, and generative AI evaluation, which makes it a credible signal of breadth. Those already working in a GCP ML environment will find the certification reinforces and formalises applied knowledge that is easy to accumulate informally.

Practise PMLE free PMLE one-page cheat sheet PMLE practice questions and domains

Examworthy is not affiliated with or endorsed by Google Cloud. This guide is original study material based on the public exam blueprint. We never reproduce live exam items. PMLE and related marks belong to their respective owners.

How to pass Google Cloud Professional Machine Learning Engineer (PMLE)

How this exam thinks

What each domain tests and how to study it

Architecting Low-Code AI Solutions

Collaborating Within and Across Teams to Manage Data and Models

Scaling Prototypes Into ML Models

Serving and Scaling Models

Automating and Orchestrating ML Pipelines

Monitoring AI Solutions

A study plan that works

Map the blueprint and book a date

Build the lifecycle decision map

Go deep on scaling and serving (Domains 3 and 4)

Lock pipelines and collaboration (Domains 5 and 2)

Cover low-code and monitoring (Domains 1 and 6)

Drill weak domains, then space the review

Sit a timed mock and calibrate

Know when you're ready

Exam-day tips

Frequently asked questions

Related certifications