Google Cloud

Google Cloud Professional Machine Learning Engineer (PMLE) practice questions

Build, deploy, scale, and monitor machine learning and generative AI solutions on Google Cloud, with a worked explanation on every practice question.

New to PMLE? Read the how to pass Google Cloud Professional Machine Learning Engineer study guide for a domain breakdown, a study plan, and exam-day tips.

Revising? The PMLE cheat sheet puts the domain weightings, key facts, and easy-to-confuse traps on one printable page.

50 to 60
Questions
120 min
Time allowed
$200
Exam cost (USD)
230
Practice questions

Exam domains and weighting

The PMLE blueprint is split across 6 domains. See the official exam guide for the authoritative breakdown.

PMLE exam domain weighting - each domain's share of the exam. Full breakdown with links below.
PMLE domains by share of the exam
DomainWeight
Architecting Low-Code AI Solutions12%
Collaborating Within and Across Teams to Manage Data and Models16%
Scaling Prototypes Into ML Models21%
Serving and Scaling Models20%
Automating and Orchestrating ML Pipelines18%
Monitoring AI Solutions13%

Free sample questions

No account needed. Every question has a worked explanation, just like the full bank.

Free sampleArchitecting Low-Code AI Solutionsmedium

A retail team runs a Gemini-based product description generator on Vertex AI. Traffic is steady at roughly 40 requests per second during business hours, and the same handful of category prompts repeat constantly because most products share templated instructions. Average latency has crept up and the monthly bill is dominated by input tokens. Which change to the Gemini request configuration will most directly cut both the per-call cost and the latency of these repeated calls?

  • AEnable context caching for the repeated prompt prefix so the shared instruction tokens are stored and billed at a reduced rate on each call. Correct
  • BRaise the temperature parameter so the model commits to an answer sooner and returns fewer retried generations per request.
  • CSwitch the endpoint from streaming to non-streaming responses so the full output arrives in one network round trip per request.
  • DIncrease the maxOutputTokens limit so each call finishes generation in a single pass instead of being truncated and retried.
Use Gemini context caching to cut cost and latency when a large prompt prefix is reused across many requests. Repeated calls that share a long instruction prefix re-process the same tokens every time; context caching persists that prefix server-side, billing it at a reduced cached rate and skipping its recomputation, so both token cost and time-to-first-token drop for the repeating workload.

Why A is correct: Context caching stores the large repeated prefix once and reuses it, so the shared instruction tokens are billed at the lower cached rate and are not re-processed, which lowers both cost and latency for the repeating prompts.

Why B is wrong: Temperature only changes how random the sampling is; it does not reduce the tokens billed or shorten the prompt, so it leaves both the input-token cost and the latency of these repeated calls untouched.

Why C is wrong: Non-streaming can feel different to a client but it does not reduce the number of tokens processed; it often raises perceived latency because nothing returns until generation finishes, so it does not address the input-token cost driver.

Why D is wrong: A higher output limit permits longer, more expensive completions rather than cheaper ones; the bottleneck here is repeated input tokens, so raising the output ceiling adds cost and latency instead of cutting them.

Free sampleArchitecting Low-Code AI Solutionsmedium

An insurer wants to extract named fields such as policy number, claimant name, and total amount from millions of scanned claim PDFs that follow a small set of fixed layouts. The team needs structured key-value output, high accuracy on these specific forms, and the lowest ongoing per-document cost rather than free-form reasoning. Which Google Cloud approach best fits this requirement?

  • APrompt a general Gemini model with each page image and ask it to return the fields as JSON, relying on its multimodal reasoning over the scans.
  • BTrain a Document AI custom extractor on the known layouts so it returns the policy fields as structured key-value pairs per document. Correct
  • CUse the Vision API text detection feature to read all characters, then write rules to locate each field by its position on the page.
  • DTranslate every document with the Translate API first to normalise the text, then apply a keyword search to pull out the required values.
Choose Document AI custom extractors for high-accuracy structured field extraction from fixed-layout documents at scale. Fixed-layout, high-volume field extraction is exactly what Document AI custom extractors are trained for: they learn the form structure and output typed key-value entities, giving better accuracy and lower per-document cost than general multimodal prompting or raw OCR plus hand-written positional rules.

Why A is wrong: Gemini can read documents, but per-document inference on millions of pages is costly and its general reasoning is less consistent for fixed-layout field extraction than a model trained on those forms, so it does not meet the accuracy-and-cost goal best.

Why B is correct: Document AI custom extractors are purpose-built to learn fixed layouts and emit structured key-value entities with high accuracy at a predictable per-page price, which directly satisfies the structured-output, accuracy, and low-cost requirements.

Why C is wrong: Vision text detection returns raw characters without document structure, so brittle position rules are needed to recover key-value pairs; this tempts teams who think OCR alone is enough but it lacks the form understanding Document AI provides.

Why D is wrong: Translate changes language and does nothing to detect layout or extract structured fields, so it is the wrong service entirely for a same-language field-extraction task even though it sounds like a preprocessing step.

Free sampleArchitecting Low-Code AI Solutionsmedium

A support team needs Gemini to answer customer questions strictly from their internal policy documents, which run to tens of thousands of pages and are revised weekly. The team has little machine-learning expertise, wants answers to reflect the latest revisions without retraining, and must keep responses grounded in the source text. Which approach should the team adopt to meet these constraints?

  • AFine-tune a Gemini model on the full policy corpus each week so the latest wording is absorbed into the model weights before queries arrive.
  • BPaste the entire policy corpus into a single very large prompt for every question so the model always sees the complete current text.
  • CBuild a retrieval-augmented generation pipeline that fetches the relevant policy passages and passes them to Gemini as grounding context at query time. Correct
  • DSelect a larger model from Model Garden and raise its temperature so it draws on broader knowledge when policies change.
Adopt retrieval-augmented generation to keep Gemini answers grounded and current without retraining when source documents change often. When source content is large and frequently revised, retrieval-augmented generation supplies the relevant up-to-date passages as context at inference time, so the model grounds its answer in the latest text without any retraining, which fine-tuning, prompt-stuffing, and model swaps cannot achieve.

Why A is wrong: Weekly fine-tuning on a huge corpus is expensive, demands ML skill the team lacks, and bakes facts into weights that still go stale between runs, so it fails the low-effort, always-current, and grounded requirements.

Why B is wrong: Tens of thousands of pages exceed practical context limits and would make each call slow and costly; stuffing the whole corpus per query is tempting because it is simple but it does not scale and wastes tokens on irrelevant material.

Why C is correct: Retrieval-augmented generation injects the current source passages into the prompt at query time, so updated documents are reflected immediately without retraining and answers stay grounded in the retrieved text, matching every stated constraint.

Why D is wrong: A bigger model and higher temperature add capacity and randomness but neither grounds answers in the internal documents nor reflects weekly revisions, so the responses can drift from the authoritative source text.

Frequently asked questions

How many questions are on the PMLE exam?
The Google Cloud Professional Machine Learning Engineer (PMLE) exam has 50 to 60 questions and runs for 120 minutes. The format is multiple choice and multiple select, online- or onsite-proctored.
What score do I need to pass PMLE?
Google Cloud does not publish a fixed pass mark for PMLE, so treat any "X%" figure you see elsewhere as unofficial. Examworthy gives you a per-domain readiness score so you can judge when you are ready across every domain.
How much does the PMLE exam cost?
The exam costs 200 USD to sit. Practising on Examworthy is free to start, with a worked explanation on every question.
How does Examworthy help me prepare for PMLE?
Every practice question carries a worked explanation and a per-distractor rationale, mapped to the official blueprint domains. You learn why each answer is right or wrong, not just the letter.
Is Examworthy affiliated with Google Cloud?
No. Examworthy is not affiliated with or endorsed by Google Cloud. Our questions are original, blueprint-aligned practice material; we never reproduce live exam items.

Related certifications

More certifications you can practise on Examworthy, related to Google Cloud Professional Machine Learning Engineer.

Browse all certifications

Examworthy is not affiliated with or endorsed by Google Cloud. All questions are original, blueprint-aligned practice material. We never reproduce live exam items. PMLE and related marks belong to their respective owners.