Objectives in this domain

Develop ML models using BigQuery ML and AutoML on the Gemini Enterprise Agent Platform, including classification, regression, forecasting, and clustering, plus feature engineering, prediction, and fine-tuning Gemini models from BigQuery.
Section 1.1medium
Build AI solutions with Google Cloud AI APIs and foundation models, selecting models from Model Garden, using industry-specific APIs such as Document AI, Vision, and Translate, tuning Gemini, Imagen, and Veo, and optimising for cost, latency, and availability.
Section 1.2medium

Sample question from this domain

Free sampleArchitecting Low-Code AI Solutionsmedium

A retail team runs a Gemini-based product description generator on Vertex AI. Traffic is steady at roughly 40 requests per second during business hours, and the same handful of category prompts repeat constantly because most products share templated instructions. Average latency has crept up and the monthly bill is dominated by input tokens. Which change to the Gemini request configuration will most directly cut both the per-call cost and the latency of these repeated calls?

AEnable context caching for the repeated prompt prefix so the shared instruction tokens are stored and billed at a reduced rate on each call. Correct
BRaise the temperature parameter so the model commits to an answer sooner and returns fewer retried generations per request.
CSwitch the endpoint from streaming to non-streaming responses so the full output arrives in one network round trip per request.
DIncrease the maxOutputTokens limit so each call finishes generation in a single pass instead of being truncated and retried.

Use Gemini context caching to cut cost and latency when a large prompt prefix is reused across many requests. Repeated calls that share a long instruction prefix re-process the same tokens every time; context caching persists that prefix server-side, billing it at a reduced cached rate and skipping its recomputation, so both token cost and time-to-first-token drop for the repeating workload.

Why A is correct: Context caching stores the large repeated prefix once and reuses it, so the shared instruction tokens are billed at the lower cached rate and are not re-processed, which lowers both cost and latency for the repeating prompts.

Why B is wrong: Temperature only changes how random the sampling is; it does not reduce the tokens billed or shorten the prompt, so it leaves both the input-token cost and the latency of these repeated calls untouched.

Why C is wrong: Non-streaming can feel different to a client but it does not reduce the number of tokens processed; it often raises perceived latency because nothing returns until generation finishes, so it does not address the input-token cost driver.

Why D is wrong: A higher output limit permits longer, more expensive completions rather than cheaper ones; the bottleneck here is repeated input tokens, so raising the output ceiling adds cost and latency instead of cutting them.

Other domains in this exam

Collaborating Within and Across Teams to Manage Data and Models16% of the exam
Scaling Prototypes Into ML Models21% of the exam
Serving and Scaling Models20% of the exam
Automating and Orchestrating ML Pipelines18% of the exam
Monitoring AI Solutions13% of the exam

See also the PMLE cert hub, the study guide, and the cheat sheet.