PMLE domain - 16% of the exam

Collaborating Within and Across Teams to Manage Data and Models

Collaborating Within and Across Teams to Manage Data and Models is 16% of the Google Cloud Professional Machine Learning Engineer (PMLE) exam. These are the objectives it covers, each with practice questions and worked explanations.

Objectives in this domain

Sample question from this domain

Free sampleCollaborating Within and Across Teams to Manage Data and Modelsmedium

A retail data team must join two years of clickstream logs stored as 4 TB of Parquet in Cloud Storage with a customer dimension table in BigQuery, producing an aggregated feature table for model training. The transformation involves several wide joins and window functions over the full history, runs as a one-off backfill, and the team wants to write standard SQL rather than maintain cluster infrastructure. Which approach should the team choose for this preprocessing job?

  • AProvision a Dataproc cluster running Apache Spark, read both sources with the connectors, and express the joins and window functions in Spark SQL.
  • BBuild a streaming Dataflow pipeline with the Apache Beam SDK that reads the Parquet files and emits aggregated features continuously.
  • CDownload the Parquet files to a single large virtual machine and process the joins with the pandas in-memory framework.
  • DLoad the Parquet files into BigQuery as an external table and run the joins and window functions as a scheduled BigQuery SQL query. Correct
Choose BigQuery for large bounded SQL-based joins and window aggregations when avoiding cluster management matters. BigQuery is a serverless analytics engine that scales terabyte joins and window functions automatically, reads Parquet via external tables, and needs no provisioned cluster, so it satisfies the scale, SQL, and no-infrastructure constraints together.

Why A is wrong: Spark on Dataproc can express the same logic and is tempting for large Parquet workloads, but it forces the team to size, run, and tear down a cluster, which contradicts their wish to avoid maintaining infrastructure for a one-off backfill.

Why B is wrong: Dataflow suits continuous or event-time pipelines, so it sounds modern here, but a streaming job is the wrong shape for a bounded historical backfill and requires Beam coding rather than the standard SQL the team asked for.

Why C is wrong: Pandas is the right reach for small datasets and feels simple, but 4 TB will not fit in a single machine's memory, so this in-memory approach fails at the required scale.

Why D is correct: BigQuery handles multi-terabyte joins and window functions in standard SQL with no cluster to manage, and external tables let it read the Parquet directly alongside the existing dimension table, matching every stated constraint.

Other domains in this exam

See also the PMLE cert hub, the study guide, and the cheat sheet.

Examworthy is not affiliated with or endorsed by Google Cloud. Original, blueprint-aligned practice material only.