A retail data team must join two years of clickstream logs stored as 4 TB of Parquet in Cloud Storage with a customer dimension table in BigQuery, producing an aggregated feature table for model training. The transformation involves several wide joins and window functions over the full history, runs as a one-off backfill, and the team wants to write standard SQL rather than maintain cluster infrastructure. Which approach should the team choose for this preprocessing job?
- AProvision a Dataproc cluster running Apache Spark, read both sources with the connectors, and express the joins and window functions in Spark SQL.
- BBuild a streaming Dataflow pipeline with the Apache Beam SDK that reads the Parquet files and emits aggregated features continuously.
- CDownload the Parquet files to a single large virtual machine and process the joins with the pandas in-memory framework.
- DLoad the Parquet files into BigQuery as an external table and run the joins and window functions as a scheduled BigQuery SQL query. Correct
Why A is wrong: Spark on Dataproc can express the same logic and is tempting for large Parquet workloads, but it forces the team to size, run, and tear down a cluster, which contradicts their wish to avoid maintaining infrastructure for a one-off backfill.
Why B is wrong: Dataflow suits continuous or event-time pipelines, so it sounds modern here, but a streaming job is the wrong shape for a bounded historical backfill and requires Beam coding rather than the standard SQL the team asked for.
Why C is wrong: Pandas is the right reach for small datasets and feels simple, but 4 TB will not fit in a single machine's memory, so this in-memory approach fails at the required scale.
Why D is correct: BigQuery handles multi-terabyte joins and window functions in standard SQL with no cluster to manage, and external tables let it read the Parquet directly alongside the existing dimension table, matching every stated constraint.