How to pass Google Cloud Professional Data Engineer (PDE)
20 min read5 domains coveredFree practice, no sign-up
The Google Cloud Professional Data Engineer (PDE) tests one skill above all others: picking the right service for a scenario. Google gives you a business situation with constraints on cost, latency, scale, and how much operational work the team can absorb, then asks which Google Cloud product or pattern fits. The hard part is rarely knowing what a service does; it is knowing which one wins when four of them could plausibly do the job and only one matches every constraint in the scenario.
It suits practitioners who already build and run data systems: data engineers, ETL and analytics developers, and architects moving onto Google Cloud who need to prove they can choose between the managed options correctly. The exam is 40 to 50 multiple-choice and multiple-select questions in 120 minutes, drawn across five domains that span designing, ingesting, storing, analysing, and maintaining data workloads. There is no formal prerequisite, but the questions assume real exposure to pipelines, warehouses, and the trade-offs between them.
The exam rewards decision rules, not feature recall. Most questions are short scenarios where two or three answers are technically capable and only one is the best fit once you weigh the constraint that was named: the lowest cost, the lowest latency, the least code to maintain, or the strictest consistency. The skill being tested is choosing correctly under that pressure, which is why practising on scenario questions with a worked explanation, and a reason every wrong option is wrong, beats memorising service datasheets.
PDE is a pick-the-right-service exam: almost every question is a scenario with cost, latency, scale, and operational constraints, and the right answer is the Google Cloud service that fits them, usually the most managed option that meets the requirement.
Difficulty
Advanced
Best for
Working data practitioners: data engineers, ETL and analytics developers, and cloud architects who design, build, and run data processing systems and need to prove they can choose the right Google Cloud service under real constraints.
Prerequisites
None enforced. Google recommends around three years of industry experience including a year or more building and managing solutions on Google Cloud. Hands-on exposure to BigQuery, Dataflow, Pub/Sub, and the storage options is what actually carries you.
40 to 50
Questions
120 min
Time allowed
$200
Exam cost (USD)
285
Practice questions
How this exam thinks
One habit decides this exam: read the scenario for its constraint, then pick the service that fits it. Almost every question is a short business situation with a stated limit on cost, latency, scale, or operational overhead, and the answer is the Google Cloud product or pattern that meets that limit. Several options will be technically capable. Only one is the best fit once you weigh what the scenario actually asked for.
The default tie-breaker is the most managed option that meets the requirement. Google designs the exam around its own preference for serverless and fully managed services, so when two answers both work, the one with less infrastructure to run usually wins: BigQuery over a self-managed warehouse, Dataflow over hand-rolled Spark on Dataproc, Pub/Sub over a Kafka cluster you operate. Reach for the less managed option only when the scenario names a reason, such as an existing Hadoop or Spark estate, an open-source dependency, or a need to lift and shift without rewriting code. That reason is the signal that the obvious managed answer is the trap.
The rest is a handful of discriminations the exam leans on, each driven by the constraint in the scenario. Pub/Sub ingests and buffers; Dataflow transforms streams and batches; Dataproc runs existing Spark and Hadoop; Composer orchestrates and schedules the whole pipeline. For storage, the access pattern decides: BigQuery for analytical SQL over huge tables, Bigtable for high-throughput low-latency key lookups, Cloud SQL for a regional relational database, Spanner when you need to scale writes horizontally with strong consistency, including across regions. Batch versus streaming, partitioning versus clustering, persistent versus job-scoped clusters all resolve the same way: name the constraint, then choose the service or setting built for it.
What each domain tests and how to study it
The PDE blueprint is split across 5 domains. Weights are the official share of the exam; see the official exam guide for the authoritative breakdown.
What you must be able to do. Given a set of business requirements, choose the architecture, security model, and migration path that meets the stated constraints on compliance, reliability, portability, and downtime.
In one sentenceThe design layer: translating business requirements into a Google Cloud architecture that is secure, reliable, portable, and migratable.
Recall check: answer these from memory first
Name the four migration tools and the one scenario each is built for, including which one you reach for when the network transfer would take weeks.
Which control finds and masks PII, and which control enforces a guardrail across every project in the organisation?
What does customer-managed encryption give you that Google's default encryption does not, and when does a scenario require it?
What it tests. Turning requirements into a sound design before any pipeline is built. Designing for security and compliance (Cloud IAM roles and least privilege, organisation policies, customer-managed encryption keys, Cloud DLP for PII, regional data sovereignty); designing for reliability and fidelity (data preparation with Dataform and Dataflow, pipeline orchestration, disaster recovery, ACID compliance, and validation); designing for flexibility and portability (mapping requirements to architecture, multi-cloud portability with BigLake, cataloguing with Dataplex Catalog); and planning data migrations with the right tool, from BigQuery Data Transfer Service to Database Migration Service, Datastream, and Transfer Appliance.
How to study it. Learn each migration tool by the job it is built for, because the exam tests the choice, not the mechanics. BigQuery Data Transfer Service loads from SaaS and other warehouses on a schedule; Database Migration Service moves operational databases into Cloud SQL or AlloyDB with low downtime; Datastream is change data capture for near-real-time replication; Transfer Appliance is the physical box for petabytes when the network would take too long. For security, learn which control answers which requirement: customer-managed encryption keys for key ownership, Cloud DLP for finding and masking PII, organisation policies for guardrails you cannot override per project. Practise reading a requirement and naming the single tool or control that fits.
Easy to confuse
Database Migration Service versus Datastream. Database Migration Service moves a database once into Cloud SQL or AlloyDB with minimal downtime; Datastream is ongoing change data capture that streams inserts and updates into BigQuery or Cloud Storage. One is a migration, the other is continuous replication.
BigQuery Data Transfer Service versus Transfer Appliance. Data Transfer Service is scheduled, network-based loading from SaaS apps and warehouses; Transfer Appliance is a shippable physical device for petabyte-scale data where the network would take too long. The deciding constraint is data volume against available bandwidth.
Organisation policy versus IAM role. An IAM role grants a principal permission to act; an organisation policy sets a guardrail that constrains what anyone can do, such as restricting resource locations. IAM says who can do something; org policy says what is allowed at all.
Worked example from the PDE bank
lock_openFree sampleDesigning Data Processing Systemsmedium
A data platform team grants an analyst the BigQuery Data Viewer role at the project level so the analyst can query several datasets. The team now wants the analyst to read only tables whose names start with the prefix sales_ in one specific dataset, without creating a new custom role and without changing the analyst's existing project-level grants. Which approach achieves this most precisely?
AAdd a deny policy at the project level that denies BigQuery read permissions on tables whose name does not start with sales_, attached to the analyst's principal.
BRemove the project-level BigQuery Data Viewer grant and instead grant BigQuery Data Viewer on every individual table whose name starts with sales_ in the target dataset.
CAdd an IAM condition to the analyst's BigQuery Data Viewer binding that uses resource.name.startsWith with the table path prefix for sales_ tables in the target dataset.check_circle Correct
DCreate an authorised view in a separate dataset that selects from the sales_ tables, and grant the analyst BigQuery Data Viewer on that dataset only.
Use IAM conditions with resource attribute expressions to scope role bindings to a subset of resources without creating a custom role. IAM conditions let you attach a CEL expression to an existing role binding. For BigQuery tables, resource.name.startsWith on the full table path is the supported attribute for prefix matching, so the analyst's Data Viewer role becomes effective only on tables whose path begins with the sales_ prefix, preserving the rest of the project-level grant unchanged.
Why A is wrong: Deny policies can restrict permissions but cannot match BigQuery table names with a startsWith expression on a resource attribute, so the negation cannot be authored cleanly and would block far more than the intended tables.
Why B is wrong: Per-table grants would work but the requirement is to leave existing project-level grants in place, and managing one binding per table does not scale as new sales_ tables are created over time.
Why C is correct: IAM conditions on a role binding evaluate CEL expressions against request and resource attributes, and resource.name.startsWith on the BigQuery table path is the documented pattern for restricting access to tables matching a name prefix.
Why D is wrong: Authorised views are useful for column or row filtering but they require maintaining one view per table or a union view, and they do not transparently expose the underlying sales_ tables to ad hoc queries by name.
What you must be able to do. Given a data flow with stated volume, latency, and code constraints, choose the ingestion service, the processing engine, and the orchestration tool that fit, and handle late or out-of-order data correctly.
In one sentenceThe heaviest domain: moving data in and transforming it, and choosing correctly between Pub/Sub, Dataflow, Dataproc, and Composer for the job in front of you.
Recall check: answer these from memory first
Name the role of Pub/Sub, Dataflow, Dataproc, and Composer in a streaming pipeline, in one line each.
Which Dataflow windowing strategy groups events by periods of user inactivity, and what decides when any window is closed?
A team has existing Apache Spark jobs to move to Google Cloud with minimal rewriting. Which processing service, and why not Dataflow?
What it tests. The core of the exam, building the pipeline. Planning pipelines (sources and sinks, transformation and orchestration logic, networking, encryption); building them with Dataflow and Apache Beam, Dataproc, Cloud Data Fusion, BigQuery, Pub/Sub, and the open-source stack of Kafka, Spark, and Hadoop, including batch and streaming transforms with windowing and late-arriving data; applying AI and ML for enrichment during ingestion with Vertex AI; and deploying and operationalising pipelines with Cloud Composer and Workflows, plus CI/CD for data pipelines.
How to study it. This is the biggest domain by weight, so spend the most time here, and most of it is one decision: which service does each stage. Fix the four-way split until it is automatic: Pub/Sub ingests and buffers events, Dataflow transforms streaming and batch data with one Beam model, Dataproc runs existing Spark and Hadoop jobs, Composer orchestrates and schedules the steps. Learn Dataflow windowing as the answer to late and out-of-order data: fixed, sliding, and session windows, with watermarks deciding when a window closes. Know when Cloud Data Fusion (visual, no-code ETL) beats writing Beam, and when Dataproc (lift-and-shift open source) beats Dataflow. The exam rewards the most managed option unless the scenario names an existing open-source estate.
Easy to confuse
Dataflow versus Dataproc. Dataflow is the serverless, fully managed choice for new pipelines on the Apache Beam model; Dataproc is managed Spark and Hadoop for lifting and shifting an existing open-source estate. If the scenario names existing Spark or Hadoop code, it is Dataproc; otherwise Dataflow is the default.
Pub/Sub versus Dataflow. Pub/Sub ingests, buffers, and delivers messages but does not transform them; Dataflow reads from Pub/Sub and does the actual processing, windowing, and aggregation. One is the pipe, the other is the engine.
Cloud Composer versus Workflows. Cloud Composer (managed Apache Airflow) orchestrates complex, scheduled data pipelines with dependencies and is the heavier, richer tool; Workflows is lightweight serverless orchestration for chaining service calls and APIs. Composer for data DAGs, Workflows for simpler service-to-service sequencing.
Worked example from the PDE bank
lock_openFree sampleIngesting and Processing Datahard
A retail analytics team ingests clickstream events into Pub/Sub and processes them with a Dataflow streaming pipeline that aggregates page views per user session, where a session ends after 20 minutes of inactivity. Sessions can run from seconds to several hours, and late events arrive up to 10 minutes behind the watermark. Which windowing strategy in Apache Beam should the team apply to compute one aggregate per logical session per user?
AApply session windows with a gap duration of 20 minutes keyed by user, with allowed lateness of 10 minutes.check_circle Correct
BApply fixed windows of 20 minutes keyed by user, with allowed lateness of 10 minutes and accumulating panes.
CApply sliding windows of 20 minutes with a 1-minute period, with allowed lateness of 10 minutes.
DApply global windows with an early trigger every 20 minutes, with allowed lateness of 10 minutes.
Choose Beam session windows when boundaries are defined by gaps in user activity rather than wall-clock intervals. Session windows in Apache Beam are dynamically created per key based on the inactivity gap between event timestamps. When a new event arrives within the gap of an existing window for that key, the window is extended, otherwise a new session is opened. This matches the requirement of one aggregate per logical user session of any duration, and allowed lateness keeps the window state alive long enough to absorb late events.
Why A is correct: Session windows are data-driven and group events for the same key whenever the gap between successive event timestamps is below the configured duration, producing exactly one window per logical session of activity.
Why B is wrong: Fixed windows split a long browsing session into arbitrary 20-minute buckets aligned to wall-clock time, so a single user session straddling a boundary is reported as two aggregates rather than one logical session.
Why C is wrong: Sliding windows emit overlapping aggregates and produce many panes per user, which is appropriate for moving averages but not for a one-aggregate-per-session contract because each event belongs to multiple windows.
Why D is wrong: The global window groups all events into a single window per key and relies on triggers for emission, which cannot express the inactivity-gap semantics of a session and would mix events from unrelated sessions.
What you must be able to do. Given an access pattern with stated latency, consistency, scale, and query needs, select the storage service and the schema design that fit, rather than the one you know best.
In one sentenceThe storage-selection domain: reading the access pattern and choosing the right database or warehouse, then modelling it well.
Recall check: answer these from memory first
Match each to its storage service: analytical SQL over petabytes; low-latency time-series writes at scale; relational with strong global consistency across regions; a sub-millisecond cache.
When do you partition a BigQuery table and when do you cluster it, and which constraint does each one reduce?
Cloud SQL or Spanner for an application that must stay strongly consistent while scaling writes globally, and why is the other one wrong?
What it tests. Matching storage to access pattern and modelling it correctly. Selecting among BigQuery, BigLake, AlloyDB, Bigtable, Spanner, Cloud SQL, Cloud Storage, Firestore, and Memorystore by analysing how the data is read and written; planning BigQuery warehouse schemas (data model design, normalisation decisions, partitioning, clustering); managing data lakes on Cloud Storage and Dataplex (discovery, access control, cost control, monitoring); and designing data platforms with Dataplex, BigLake, and federated governance for distributed systems.
How to study it. Build one decision tree for the storage choice and drill it, because this domain is almost entirely selection. Start from the access pattern: analytical SQL over huge tables is BigQuery; high-throughput, low-latency key reads and writes (time series, IoT) are Bigtable; a regional relational database is Cloud SQL; relational with horizontal scale and strong global consistency is Spanner; document data for mobile and web with offline sync is Firestore; a sub-millisecond cache is Memorystore. Then learn BigQuery modelling as a second decision: partition on a date or integer to prune by time, cluster on the columns you filter and group by to prune within partitions. Know that partitioning and clustering reduce cost by reading less data, which is exactly the constraint the exam names.
Easy to confuse
Bigtable versus BigQuery. Bigtable is a NoSQL store for high-throughput, low-latency reads and writes by key, such as time series and IoT; BigQuery is a serverless warehouse for analytical SQL and aggregation over large tables. If the scenario wants fast single-key access at scale it is Bigtable; if it wants ad hoc analytics it is BigQuery.
Cloud SQL versus Spanner. Cloud SQL is managed MySQL, PostgreSQL, or SQL Server for regional workloads that fit on one primary; Spanner adds horizontal scaling with strong consistency, including across regions. Choose Spanner when the scenario needs write throughput or scale beyond a single primary, or strong consistency across regions; otherwise Cloud SQL is cheaper and simpler.
Partitioning versus clustering in BigQuery. Partitioning splits a table into segments by a date or integer column so queries scan only the relevant range; clustering sorts data within each partition by chosen columns so filters and aggregations read less. Partition first by time, then cluster by the columns you filter on.
Worked example from the PDE bank
lock_openFree sampleStoring Datahard
An architect is comparing BigQuery and Bigtable for a workload that records device telemetry from two million industrial sensors. Each sensor emits a reading every second, and downstream applications need to retrieve the most recent 24 hours of readings for any single sensor within tens of milliseconds, while a separate weekly analytical job scans aggregates across the full fleet. The architect wants to understand the fundamental role boundary between the two services. Which statement most accurately describes how BigQuery and Bigtable differ for this workload?
ABigtable is a wide-column NoSQL store with sorted row keys that gives single-digit millisecond reads for a known key, while BigQuery is a columnar analytical warehouse designed for high-throughput scans across very large tables; the per-sensor lookup belongs in Bigtable and the weekly aggregate belongs in BigQuery.check_circle Correct
BBigQuery is a wide-column NoSQL store optimised for single-row lookups by key, while Bigtable is a columnar analytical warehouse tuned for ad hoc SQL scans, so the per-sensor lookups should target BigQuery and the weekly aggregate should target Bigtable.
CBigQuery and Bigtable both target operational workloads, but BigQuery is preferred whenever rows exceed one kilobyte and Bigtable is preferred whenever rows are smaller, regardless of access pattern.
DBigtable and BigQuery are interchangeable for telemetry because both are columnar; the team should pick the cheaper one for the region and accept identical latency characteristics from each service.
Distinguish Bigtable as a low-latency wide-column NoSQL store from BigQuery as a columnar analytical warehouse when serving telemetry. Bigtable stores rows sorted by a single row key and is engineered for low-latency point and small range reads at very high write rates, which is exactly the per-sensor recent-history pattern. BigQuery stores data in columnar format across distributed storage and uses a slot-based execution engine that excels at scanning and aggregating across large tables, which is the weekly fleet-wide pattern. Choosing each service for the access pattern it was built for is the canonical PDE role boundary.
Why A is correct: Bigtable is sorted by row key and serves point and small range reads in low single-digit milliseconds, which suits the per-sensor 24-hour lookup, while BigQuery's columnar storage and slot-based execution are designed to scan and aggregate across very large tables on schedule, which suits the weekly cross-fleet job.
Why B is wrong: This reverses the actual roles. BigQuery is the columnar analytical warehouse and Bigtable is the wide-column key-ordered NoSQL store, so the description swaps the two services. A candidate who only half-remembers the column orientation of BigQuery can fall into this trap.
Why C is wrong: BigQuery is an analytical warehouse, not an operational store, and the selection between Bigtable and BigQuery is driven by access pattern rather than row size. The size-based rule sounds concrete but is fabricated and will mislead a candidate who has not internalised the role boundary.
Why D is wrong: Although both services use a column-oriented physical layout, their access patterns and latency profiles are very different. Bigtable serves low-latency keyed reads while BigQuery serves throughput-oriented scans, so they are not interchangeable for a real-time per-sensor lookup.
What you must be able to do. Given an analytics, ML, or sharing requirement, choose the technique that delivers it at the lowest cost and latency, from materialised views to BigQuery ML to Analytics Hub.
In one sentenceThe consumption layer: making stored data fast to query, ready for ML, and safe to share.
Recall check: answer these from memory first
When do you use BI Engine and when do you use a materialised view to speed up BigQuery, and what does each one actually do?
Why does BigQuery ML suit a team whose data is already in BigQuery, and what does it save you doing?
Which sharing mechanism lets two organisations analyse combined data without either seeing the other's raw rows?
What it tests. Turning stored data into analysis, models, and shared products. Preparing data for visualisation (BigQuery BI Engine for in-memory acceleration, materialised views to precompute, troubleshooting query performance, masking with IAM and Cloud DLP); preparing data for AI and ML (feature engineering, training and serving with BigQuery ML, preparing unstructured data for embeddings and retrieval-augmented generation); and sharing data (sharing rules, publishing datasets, Analytics Hub, and data clean rooms for privacy-safe collaboration).
How to study it. Learn each acceleration and sharing technique as the answer to a named constraint. Use BI Engine when a dashboard needs sub-second response over BigQuery; use a materialised view when the same expensive aggregation is queried repeatedly, because it precomputes and refreshes automatically. For ML, know that BigQuery ML lets you train and predict in SQL without moving data, which the exam prefers when the data already lives in BigQuery and the team works in SQL; embeddings and retrieval-augmented generation are the path for unstructured text. For sharing, learn the ladder: dataset access for simple grants, Analytics Hub for publishing curated data to many consumers, and data clean rooms when two parties must analyse combined data without exposing the raw rows.
Easy to confuse
BI Engine versus materialised view. BI Engine is an in-memory layer that accelerates queries and dashboards without changing them; a materialised view precomputes and stores the result of a specific expensive query and refreshes it incrementally. BI Engine speeds many queries broadly; a materialised view targets one repeated aggregation.
BigQuery ML versus Vertex AI. BigQuery ML trains and serves models in SQL on data already in BigQuery, ideal for analysts and in-warehouse workflows; Vertex AI is the full platform for custom models, deep learning, and MLOps. Pick BigQuery ML for SQL-first, in-place modelling; Vertex AI when the scenario needs custom training or serving beyond SQL.
Analytics Hub versus data clean room. Analytics Hub publishes and subscribes to shared datasets across organisations; a data clean room lets parties run analysis on combined data while keeping the underlying rows private. Analytics Hub shares data you are willing to expose; a clean room enables joint analysis on data you are not.
Worked example from the PDE bank
lock_openFree samplePreparing and Using Data for Analysismedium
An analytics team has enabled a BigQuery BI Engine reservation in the same region as a Looker dashboard that queries a 40 GB fact table with several aggregations per panel. The team wants to understand exactly which queries the reservation will accelerate so they can size it correctly. Which statement best describes how BI Engine acceleration is applied to incoming queries?
ABI Engine accelerates only queries issued through Looker Studio and ignores queries submitted by other clients such as the BigQuery console or the bq command-line tool.
BBI Engine accelerates all queries that touch any table referenced by the dashboard regardless of region, because reservations are global resources shared across BigQuery locations.
CBI Engine accelerates eligible SQL queries against tables in the reserved project and region by serving them from an in-memory cache, falling back to standard BigQuery slots for unsupported features.check_circle Correct
DBI Engine accelerates queries by precomputing aggregations into a materialised view that is automatically registered in the reservation and refreshed on every base table change.
Recognise that BI Engine is a regional in-memory acceleration layer that transparently serves eligible BigQuery SQL and falls back to slots otherwise. BI Engine reservations are scoped to a project and region. When a query runs, BigQuery checks whether the referenced data fits the reservation and whether the query uses BI Engine supported SQL features. Eligible work is served from the in-memory cache, while unsupported operators or excess data fall back to standard slot execution. This client-agnostic, partial-acceleration behaviour is central to sizing decisions.
Why A is wrong: It is tempting because BI Engine was originally promoted as a Looker Studio accelerator. In practice acceleration is client-agnostic and applies to any SQL query that fits within the reservation's supported feature set, including queries from the console, bq, drivers, and Looker.
Why B is wrong: Region matching trips up many candidates. BI Engine reservations are regional, and a reservation only accelerates queries that run in the same location as the reserved data. Cross-region queries cannot be served from the cache.
Why C is correct: This is correct. BI Engine maintains an in-memory representation of frequently accessed data and rewrites supported query patterns to read from that cache. Queries or query fragments that use unsupported SQL features run on standard slots, so partial acceleration is possible.
Why D is wrong: This blurs BI Engine with materialised views. BI Engine is an in-memory caching layer, not a precomputation engine, and it does not create or own materialised views on the user's behalf.
What you must be able to do. Given a running workload with cost, reliability, or capacity pressure, choose the optimisation, automation, monitoring, or failover approach that meets the constraint at the least overhead.
In one sentenceThe operations layer: keeping workloads cheap, automated, observable, and resilient once they are live.
Recall check: answer these from memory first
When do you choose BigQuery on-demand pricing and when do you choose Editions with slot reservations, and which constraint pushes you to reservations?
Persistent or ephemeral Dataproc cluster for intermittent nightly batch jobs, and why does that choice save money?
Where do you look first for a slow or expensive BigQuery query: Cloud Monitoring, Cloud Logging, or INFORMATION_SCHEMA, and what does each one tell you?
What it tests. Running data workloads economically and reliably. Optimising resources to minimise cost (BigQuery Editions and reservations, Dataproc autoscaling, persistent versus job-scoped clusters); automation and repeatability with Cloud Composer DAGs; organising workloads with BigQuery Editions and slot reservations and classifying jobs as interactive or batch; monitoring and troubleshooting with Cloud Monitoring, Cloud Logging, the BigQuery admin panel, and INFORMATION_SCHEMA, including quota management; and mitigating failures through fault-tolerant design, multi-region and multi-zone deployment, and replication and failover for Cloud SQL and Redis.
How to study it. Learn the cost and capacity levers as decisions driven by the workload pattern. For BigQuery, on-demand pricing charges per byte scanned and suits spiky or low-volume use; Editions with slot reservations give predictable capacity and cost for steady, heavy use, which is the constraint the exam names when it mentions budgets or predictable spend. For Dataproc, choose a persistent cluster for constant work and a job-scoped (ephemeral) cluster that spins up and tears down per job to cut cost for intermittent batch. Learn where each signal lives: Cloud Monitoring for metrics and alerts, Cloud Logging for logs, INFORMATION_SCHEMA and the admin panel for query and slot analysis. For resilience, match the design to the recovery requirement: multi-zone for high availability within a region, multi-region for surviving a regional outage.
Easy to confuse
BigQuery on-demand versus Editions (slot reservations). On-demand charges per byte scanned and needs no commitment, best for unpredictable or low-volume querying; Editions reserve slots for predictable capacity and cost, best for steady heavy workloads. The deciding constraint is whether the workload is spiky or steady and whether spend must be predictable.
Persistent versus ephemeral Dataproc cluster. A persistent cluster stays running for constant or interactive work; an ephemeral, job-scoped cluster is created for one job and deleted after, which costs nothing when idle. Choose ephemeral for intermittent batch to avoid paying for idle capacity.
Multi-zone versus multi-region deployment. Multi-zone spreads resources across zones in one region to survive a zone failure; multi-region spreads across regions to survive a whole-region outage at higher cost and latency. Match the choice to whether the scenario must survive a zone failure or a regional one.
Worked example from the PDE bank
lock_openFree sampleMaintaining and Automating Data Workloadsmedium
A retail analytics group has migrated from BigQuery on-demand pricing to BigQuery Editions and now runs all interactive workloads against a single Enterprise edition reservation with autoscaling enabled. They observe that small ad hoc queries from analysts often wait several seconds before any slots are allocated, even though baseline slots are set to zero. Which statement best describes how the baseline and maximum slot settings on a reservation affect this behaviour?
ABaseline slots and autoscaler slots are both provisioned on demand, so any query against an Enterprise edition reservation incurs the same scale-up delay regardless of the baseline value.
BBaseline slots are always available without scale-up latency, while autoscaler slots above the baseline are provisioned on demand and can take a short time to spin up before they become billable.check_circle Correct
CBaseline slots define the maximum the reservation can ever use, and the autoscaler simply rebalances those slots between queries when contention is detected by the scheduler.
DBaseline slots are billed only when they are actively used by a query, while autoscaler slots are billed for the full reservation window once any query triggers scale-up activity.
Explain how baseline and autoscaler slot settings in a BigQuery Editions reservation affect query start latency and billing. A BigQuery Editions reservation keeps the baseline number of slots permanently assigned to the reservation, so queries can use them with no scale-up delay. When demand exceeds the baseline, the autoscaler adds slots in increments up to the configured maximum. These autoscaler slots take a short time to provision and are billed per second only while they are active, which is why analysts see a small wait when the baseline is zero.
Why A is wrong: Tempting because reservations feel elastic end to end, but it is wrong because baseline capacity is held continuously and is available without scale-up; only the autoscaler portion is provisioned on demand.
Why B is correct: Correct. The baseline is the floor that is reserved continuously, so queries using only baseline capacity start immediately, while slots above the baseline are added by the autoscaler in increments and incur a brief provisioning delay before they begin charging.
Why C is wrong: Tempting because the baseline does set a floor, but it does not cap the reservation. The maximum reservation size is a separate setting, and the autoscaler adds slots above the baseline rather than just rebalancing fixed capacity.
Why D is wrong: Tempting because it sounds like a usage-based model, but it inverts the billing. Baseline slots are billed continuously while reserved, and autoscaler slots are billed per second they are active, not for a full window.
A study plan that works
Map the blueprint and book a date
Day 1
Read the official Google Cloud exam guide and the five domains with their weights. Book a provisional date now: a fixed date turns open-ended study into a plan and is the strongest predictor of actually sitting. Note that Ingesting and Processing (25 percent) and Designing (22 percent) are nearly half the exam between them.
Build the service-selection map
Week 1
Before drilling any domain, build the two decision trees the whole exam rests on: ingest and process (Pub/Sub, Dataflow, Dataproc, Composer) and storage (BigQuery, Bigtable, Cloud SQL, Spanner, Firestore, Memorystore). Use the recall checks in this guide: cover the answer, choose the service from the constraint, then reveal. If you cannot pick from the access pattern alone, you do not own it yet.
Go deep on ingesting and designing (Domains 2 and 1)
Weeks 1 to 3
These two are nearly half the exam, so they get the most time. Drill the four-way pipeline split and Dataflow windowing for late data, and learn each migration tool by the scenario it fits. Practise on scenario questions and read the worked explanation on every one, including the ones you got right, watching for the named constraint that picks the answer.
Lock storage selection and BigQuery modelling (Domain 3)
Weeks 3 to 4
Storage selection is reliable marks if you drill it as a decision tree from the access pattern. Add BigQuery modelling: partition by time to prune ranges, cluster by filtered columns to read less. Do the Bigtable-versus-BigQuery and Cloud SQL-versus-Spanner calls by hand until the constraint alone decides them.
Cover analysis and operations (Domains 4 and 5)
Week 4
Analysis rewards knowing when BI Engine, materialised views, and BigQuery ML each apply; operations rewards the cost levers (on-demand versus Editions, persistent versus ephemeral clusters) and the resilience choices. Both are learnable and tie straight back to cost and latency constraints, so they are dependable marks.
Drill weak domains, then space the review
Week 5
Use your per-domain accuracy to attack the two domains dragging you down, not to re-read what you already know. Then space it: revisit each domain's recall prompts after a few days and again a week later. Spacing roughly doubles what sticks compared with cramming.
Sit a timed mock and calibrate
Weeks 5 to 6
Take at least one full timed mock under exam conditions to rehearse pacing and the flag-and-return habit across 40 to 50 questions in 120 minutes. Treat the score as a per-domain readiness signal, not a single number, and review every missed question, naming the constraint you misread, before you book or sit.
Know when you're ready
Readiness for the Google Cloud Professional Data Engineer is a score on scenario questions you have not seen before, not a feeling that the services are familiar. Those are different things, and the gap between them is where people fail. Re-reading the docs builds fluency, and fluency feels like knowledge, so confidence rises while real recall does not. The fix is to test yourself: if you can read a fresh scenario, name the constraint, and pick the right service while explaining why each other option is wrong, you know it; if you can only nod along to an explanation, you do not yet.
Be especially wary of early confidence on the service map. Knowing what BigQuery, Bigtable, Dataflow, and Dataproc each do is the easy half; choosing between them under a cost or latency constraint, when two of them would work, is the half the exam actually tests. Trust your measured per-domain accuracy over your gut, and set the bar at clearing every domain comfortably on unseen questions across more than one session, not scraping a single pass.
This guide gives you the map. The practice bank is where you find out whether you can navigate it, with a worked explanation and a reason every distractor is wrong on every question. Readiness scoring tells you when you are there. Not before.
Ready to put this into practice?
Free PDE questions with worked explanations. No sign-up.
Read the scenario for its constraint first. The cost, latency, scale, or operational-overhead limit named in the question is what picks the answer, so find it before you judge the options.
When two services both work, default to the most managed one. Google prefers serverless and fully managed; reach for the less managed option only when the scenario names a reason, such as existing Spark or Hadoop code.
Treat an existing open-source estate as a signal. Words like existing Spark, Hadoop, or Kafka usually point to Dataproc or a lift-and-shift answer over the otherwise obvious Dataflow or Pub/Sub choice.
Let the access pattern pick the storage. Analytical SQL means BigQuery, low-latency key reads mean Bigtable, regional relational means Cloud SQL, global strong consistency means Spanner; do not default to the one you know best.
Watch for the cost trap. When a question stresses minimising cost or predictable spend, the answer is usually the lever built for it: partitioning and clustering, on-demand versus Editions, or an ephemeral cluster over a persistent one.
Flag and move on. Cover every question once before you spend time on a hard one; with 40 to 50 questions in 120 minutes, collecting the clear marks first protects the ones you actually know.
Frequently asked questions
Is the Google Cloud Professional Data Engineer hard?
It is an advanced, professional-level exam, and the difficulty is judgement rather than recall. Most questions are scenarios where several Google Cloud services could work and only one fits the stated cost, latency, scale, or operational constraint. Scenario practice with worked explanations matters far more than memorising what each service does.
How long should I study for the PDE?
Most candidates with real Google Cloud experience are ready in six to eight weeks of steady study. Less hands-on exposure means more time on the two heavy domains, Ingesting and Processing and Designing, and on the service-selection decisions the whole exam rests on.
What is the pass mark for the PDE?
Google does not publish an official passing score for its professional exams, and the result is reported as pass or fail. Because there is no public percentage to target, aim to clear every domain comfortably on unseen practice questions rather than chasing a raw figure.
Do I need to know how to code for this exam?
You need to read and reason about pipelines, SQL, and the Apache Beam model, but the exam is about choosing and configuring services, not writing programs from scratch. Comfort with SQL and an understanding of how Dataflow, Dataproc, and BigQuery process data is what carries you.
How much does the exam cost and how long is it?
The exam is 200 USD and runs for 120 minutes, with 40 to 50 multiple-choice and multiple-select questions, as shown in the facts panel above. It is taken online-proctored from your own location or onsite at a test centre.
Which domains should I focus on?
Ingesting and Processing Data at 25 percent and Designing Data Processing Systems at 22 percent are nearly half the exam, so they deserve the most time. Storing Data at 20 percent is close behind and rewards a clean storage-selection decision tree, so do not leave it short.
How many practice questions should I do before booking?
Enough that every domain clears comfortably on questions you have not seen, and a full timed mock feels comfortable on pacing. Quality of review beats raw volume: on every question, read the explanation and name the constraint that picked the answer, including on the ones you got right.
Is the Google Cloud Professional Data Engineer worth it?
It is one of the more respected data engineering credentials in the Google Cloud ecosystem, and practitioners who hold it have typically demonstrated they can reason about service selection under real constraints, not just recall what each product does. The preparation is worthwhile beyond the exam itself: working through the distinctions between Dataflow and Dataproc, or between Bigtable and BigQuery, under cost and latency trade-offs tends to sharpen architectural decision-making in ways that transfer directly to real projects. A common next step for those moving further into applied AI is the Google Cloud Professional Machine Learning Engineer certification.
Examworthy is not affiliated with or endorsed by Google Cloud. This guide is original study material based on the public exam blueprint. We never reproduce live exam items. PDE and related marks belong to their respective owners.