A retail team runs a Gemini-based product description generator on Vertex AI. Traffic is steady at roughly 40 requests per second during business hours, and the same handful of category prompts repeat constantly because most products share templated instructions. Average latency has crept up and the monthly bill is dominated by input tokens. Which change to the Gemini request configuration will most directly cut both the per-call cost and the latency of these repeated calls?
- AEnable context caching for the repeated prompt prefix so the shared instruction tokens are stored and billed at a reduced rate on each call. Correct
- BRaise the temperature parameter so the model commits to an answer sooner and returns fewer retried generations per request.
- CSwitch the endpoint from streaming to non-streaming responses so the full output arrives in one network round trip per request.
- DIncrease the maxOutputTokens limit so each call finishes generation in a single pass instead of being truncated and retried.
Why A is correct: Context caching stores the large repeated prefix once and reuses it, so the shared instruction tokens are billed at the lower cached rate and are not re-processed, which lowers both cost and latency for the repeating prompts.
Why B is wrong: Temperature only changes how random the sampling is; it does not reduce the tokens billed or shorten the prompt, so it leaves both the input-token cost and the latency of these repeated calls untouched.
Why C is wrong: Non-streaming can feel different to a client but it does not reduce the number of tokens processed; it often raises perceived latency because nothing returns until generation finishes, so it does not address the input-token cost driver.
Why D is wrong: A higher output limit permits longer, more expensive completions rather than cheaper ones; the bottleneck here is repeated input tokens, so raising the output ceiling adds cost and latency instead of cutting them.