Design, build, and manage data processing systems on Google Cloud, with a worked explanation on every practice question.
Free sample questions
No account needed. Every question has a worked explanation, just like the full bank.
lock_openFree sampleIngesting and Processing Datahard
A retail analytics team ingests clickstream events into Pub/Sub and processes them with a Dataflow streaming pipeline that aggregates page views per user session, where a session ends after 20 minutes of inactivity. Sessions can run from seconds to several hours, and late events arrive up to 10 minutes behind the watermark. Which windowing strategy in Apache Beam should the team apply to compute one aggregate per logical session per user?
- AApply session windows with a gap duration of 20 minutes keyed by user, with allowed lateness of 10 minutes.check_circle Correct
- BApply fixed windows of 20 minutes keyed by user, with allowed lateness of 10 minutes and accumulating panes.
- CApply sliding windows of 20 minutes with a 1-minute period, with allowed lateness of 10 minutes.
- DApply global windows with an early trigger every 20 minutes, with allowed lateness of 10 minutes.
Choose Beam session windows when boundaries are defined by gaps in user activity rather than wall-clock intervals. Session windows in Apache Beam are dynamically created per key based on the inactivity gap between event timestamps. When a new event arrives within the gap of an existing window for that key, the window is extended, otherwise a new session is opened. This matches the requirement of one aggregate per logical user session of any duration, and allowed lateness keeps the window state alive long enough to absorb late events.
Why A is correct: Session windows are data-driven and group events for the same key whenever the gap between successive event timestamps is below the configured duration, producing exactly one window per logical session of activity.
Why B is wrong: Fixed windows split a long browsing session into arbitrary 20-minute buckets aligned to wall-clock time, so a single user session straddling a boundary is reported as two aggregates rather than one logical session.
Why C is wrong: Sliding windows emit overlapping aggregates and produce many panes per user, which is appropriate for moving averages but not for a one-aggregate-per-session contract because each event belongs to multiple windows.
Why D is wrong: The global window groups all events into a single window per key and relies on triggers for emission, which cannot express the inactivity-gap semantics of a session and would mix events from unrelated sessions.
lock_openFree sampleIngesting and Processing Datahard
A fraud-detection pipeline in Dataflow joins a high-volume stream of card transactions from Pub/Sub against a moderately sized lookup of merchant risk scores that is refreshed in BigQuery every 15 minutes. The lookup fits comfortably in worker memory, and each transaction must be enriched with the latest available risk score for its merchant. Which Apache Beam construct should the team use to perform this enrichment efficiently?
- ARead the BigQuery lookup as a bounded PCollection and apply CoGroupByKey against the streaming transactions on merchant id.
- BModel the BigQuery lookup as a periodically refreshed side input and reference it from a ParDo that enriches each transaction.check_circle Correct
- CApply CombinePerKey on the merged stream of transactions and lookup rows to retain the most recent risk score per merchant.
- DCall the BigQuery Storage Read API from inside a ParDo for every incoming transaction to fetch the latest risk score.
Use a periodically refreshed side input to broadcast a small, slowly changing dimension into a Beam streaming join. Side inputs in Apache Beam are designed for broadcasting auxiliary data to every worker so that a main ParDo can look it up without shuffling the main input. When the auxiliary data changes on a schedule, a periodic side input pattern reissues the read on a fixed cadence and exposes the latest snapshot to the main transform, which is more efficient than CoGroupByKey for small dimensions and avoids per-element remote calls.
Why A is wrong: CoGroupByKey requires both inputs to be keyed PCollections in compatible windows and would shuffle the entire transaction stream by merchant id, which is far heavier than a broadcast lookup and does not match the periodic-refresh semantics of the risk table.
Why B is correct: A periodically refreshed side input materialises the lookup on every worker, refreshes it on the configured cadence, and lets a ParDo read it as an in-memory map, which is the canonical pattern for broadcast joins of streams with slowly changing dimensions.
Why C is wrong: CombinePerKey aggregates values per key but does not produce per-transaction enriched output; it would collapse transactions and risk rows into a single value per merchant and lose every individual transaction record.
Why D is wrong: Per-element remote calls add latency and quota pressure proportional to throughput, and they ignore the fact that the lookup is small and changes only every 15 minutes, making this both slower and more expensive than a side input.
lock_openFree sampleIngesting and Processing Datahard
A logistics team runs a Dataflow streaming job that computes per-minute delivery counts in tumbling windows. Roughly 2 percent of GPS events arrive between 30 seconds and 8 minutes after their event time because of intermittent driver connectivity, and downstream dashboards must reflect corrected counts when late data lands. The job currently uses the default trigger with no allowed lateness, and late events are being dropped. Which configuration best preserves accuracy while keeping per-window state bounded?
- ASwitch to processing-time windows of one minute so events are bucketed on arrival and lateness becomes irrelevant.
- BKeep the default trigger and set allowed lateness to 10 minutes, accepting that late panes will overwrite earlier results in the sink.
- CConfigure an event-time trigger at the end of the window plus a late-firing trigger after each late element, with allowed lateness of 10 minutes and accumulating panes.check_circle Correct
- DDisable the watermark by setting allowed lateness to an unlimited duration and rely on a global trigger to fire once at job drain.
Combine event-time triggers, late-firing triggers, and accumulating panes to handle bounded late data without unbounded state. The watermark estimates the progress of event time and gates the on-time pane for a window. Configuring a late-firing trigger together with bounded allowed lateness keeps per-window state alive only as long as late data may reasonably arrive, while accumulating panes mean each emission represents the full corrected count for the window. This is the textbook pattern for dashboards that must converge to an accurate event-time result.
Why A is wrong: Processing-time windows mis-attribute late events to the window in which they happen to arrive, which corrupts per-minute delivery counts and is the opposite of what the dashboard requires for event-time correctness.
Why B is wrong: The default trigger fires once at the end of the window and again per late element, but without an explicit late trigger the late-data semantics are implementation-defined and discarding state behaviour cannot be coordinated with accumulation mode for the dashboard.
Why C is correct: An event-time trigger emits an on-time result when the watermark passes the window end, while a late trigger emits an updated pane for each late element within the 10-minute allowance, and accumulating panes ensure each emission represents the corrected cumulative count for the window.
Why D is wrong: Unlimited allowed lateness causes per-window state to grow without bound, and waiting for drain defeats the purpose of streaming dashboards by delaying every result until the job stops.
Frequently asked questions
- How many questions are on the PDE exam?
- The Google Cloud Professional Data Engineer (PDE) exam has 40 to 50 questions and runs for 120 minutes. The format is multiple choice and multiple select, online- or onsite-proctored.
- What score do I need to pass PDE?
- Google Cloud does not publish a fixed pass mark for PDE, so treat any "X%" figure you see elsewhere as unofficial. Examworthy gives you a per-domain readiness score so you can judge when you are ready across every domain.
- How much does the PDE exam cost?
- The exam costs 200 USD to sit. Practising on Examworthy is free to start, with a worked explanation on every question.
- How does Examworthy help me prepare for PDE?
- Every practice question carries a worked explanation and a per-distractor rationale, mapped to the official blueprint domains. You learn why each answer is right or wrong, not just the letter.
- Is Examworthy affiliated with Google Cloud?
- No. Examworthy is not affiliated with or endorsed by Google Cloud. Our questions are original, blueprint-aligned practice material; we never reproduce live exam items.
Examworthy is not affiliated with or endorsed by Google Cloud. All questions are original, blueprint-aligned practice material. We never reproduce live exam items. PDE and related marks belong to their respective owners.