PDE domain - 25% of the exam

Ingesting and Processing Data

Ingesting and Processing Data is 25% of the Google Cloud Professional Data Engineer (PDE) exam. These are the objectives it covers, each with practice questions and worked explanations.

Objectives in this domain

Sample question from this domain

Free sampleIngesting and Processing Datahard

A retail analytics team ingests clickstream events into Pub/Sub and processes them with a Dataflow streaming pipeline that aggregates page views per user session, where a session ends after 20 minutes of inactivity. Sessions can run from seconds to several hours, and late events arrive up to 10 minutes behind the watermark. Which windowing strategy in Apache Beam should the team apply to compute one aggregate per logical session per user?

  • AApply session windows with a gap duration of 20 minutes keyed by user, with allowed lateness of 10 minutes. Correct
  • BApply fixed windows of 20 minutes keyed by user, with allowed lateness of 10 minutes and accumulating panes.
  • CApply sliding windows of 20 minutes with a 1-minute period, with allowed lateness of 10 minutes.
  • DApply global windows with an early trigger every 20 minutes, with allowed lateness of 10 minutes.
Choose Beam session windows when boundaries are defined by gaps in user activity rather than wall-clock intervals. Session windows in Apache Beam are dynamically created per key based on the inactivity gap between event timestamps. When a new event arrives within the gap of an existing window for that key, the window is extended, otherwise a new session is opened. This matches the requirement of one aggregate per logical user session of any duration, and allowed lateness keeps the window state alive long enough to absorb late events.

Why A is correct: Session windows are data-driven and group events for the same key whenever the gap between successive event timestamps is below the configured duration, producing exactly one window per logical session of activity.

Why B is wrong: Fixed windows split a long browsing session into arbitrary 20-minute buckets aligned to wall-clock time, so a single user session straddling a boundary is reported as two aggregates rather than one logical session.

Why C is wrong: Sliding windows emit overlapping aggregates and produce many panes per user, which is appropriate for moving averages but not for a one-aggregate-per-session contract because each event belongs to multiple windows.

Why D is wrong: The global window groups all events into a single window per key and relies on triggers for emission, which cannot express the inactivity-gap semantics of a session and would mix events from unrelated sessions.

Other domains in this exam

See also the PDE cert hub, the study guide, and the cheat sheet.

Examworthy is not affiliated with or endorsed by Google Cloud. Original, blueprint-aligned practice material only.