Objectives in this domain

Explore and preprocess data for ML across tabular, text, and image types, choosing the right tool for scale such as BigQuery, Dataflow, Apache Spark, and in-memory Python frameworks, consolidating features in the Agent Platform Feature Store, and protecting personally identifiable information.
Section 2.1medium
Prototype models in notebooks using Agent Platform Workbench and Colab Enterprise, applying collaboration and security best practices with common frameworks such as PyTorch, scikit-learn, and JAX, and using Model Garden models.
Section 2.2medium
Track and run ML experiments, choosing the right environment among Experiments on the Agent Platform, Agent Platform Pipelines, and Kubeflow Pipelines, evaluating predictive and gen AI solutions including LLM-as-a-judge, and tracking artifacts, versions, and lineage with ML Metadata.
Section 2.3medium

Sample question from this domain

Free sampleCollaborating Within and Across Teams to Manage Data and Modelsmedium

A retail data team must join two years of clickstream logs stored as 4 TB of Parquet in Cloud Storage with a customer dimension table in BigQuery, producing an aggregated feature table for model training. The transformation involves several wide joins and window functions over the full history, runs as a one-off backfill, and the team wants to write standard SQL rather than maintain cluster infrastructure. Which approach should the team choose for this preprocessing job?

AProvision a Dataproc cluster running Apache Spark, read both sources with the connectors, and express the joins and window functions in Spark SQL.
BBuild a streaming Dataflow pipeline with the Apache Beam SDK that reads the Parquet files and emits aggregated features continuously.
CDownload the Parquet files to a single large virtual machine and process the joins with the pandas in-memory framework.
DLoad the Parquet files into BigQuery as an external table and run the joins and window functions as a scheduled BigQuery SQL query. Correct

Choose BigQuery for large bounded SQL-based joins and window aggregations when avoiding cluster management matters. BigQuery is a serverless analytics engine that scales terabyte joins and window functions automatically, reads Parquet via external tables, and needs no provisioned cluster, so it satisfies the scale, SQL, and no-infrastructure constraints together.

Why A is wrong: Spark on Dataproc can express the same logic and is tempting for large Parquet workloads, but it forces the team to size, run, and tear down a cluster, which contradicts their wish to avoid maintaining infrastructure for a one-off backfill.

Why B is wrong: Dataflow suits continuous or event-time pipelines, so it sounds modern here, but a streaming job is the wrong shape for a bounded historical backfill and requires Beam coding rather than the standard SQL the team asked for.

Why C is wrong: Pandas is the right reach for small datasets and feels simple, but 4 TB will not fit in a single machine's memory, so this in-memory approach fails at the required scale.

Why D is correct: BigQuery handles multi-terabyte joins and window functions in standard SQL with no cluster to manage, and external tables let it read the Parquet directly alongside the existing dimension table, matching every stated constraint.

Other domains in this exam

Architecting Low-Code AI Solutions12% of the exam
Scaling Prototypes Into ML Models21% of the exam
Serving and Scaling Models20% of the exam
Automating and Orchestrating ML Pipelines18% of the exam
Monitoring AI Solutions13% of the exam

See also the PMLE cert hub, the study guide, and the cheat sheet.