Objectives in this domain

Plan data pipelines by defining sources and sinks, transformation and orchestration logic, networking fundamentals, and data encryption requirements.
Section 2.1medium
Build data pipelines using Dataflow, Apache Beam, Dataproc, Cloud Data Fusion, BigQuery, Pub/Sub, Kafka, Spark, and Hadoop, including batch and streaming transformations with windowing and late-arriving data handling.
Section 2.2hard
Apply AI and ML for data enrichment during ingestion and processing, and integrate new data sources.
Section 2.3medium
Deploy and operationalise pipelines using Cloud Composer and Workflows for job automation, and implement CI/CD for data pipelines.
Section 2.4medium

Sample question from this domain

Free sampleIngesting and Processing Datahard

A retail analytics team ingests clickstream events into Pub/Sub and processes them with a Dataflow streaming pipeline that aggregates page views per user session, where a session ends after 20 minutes of inactivity. Sessions can run from seconds to several hours, and late events arrive up to 10 minutes behind the watermark. Which windowing strategy in Apache Beam should the team apply to compute one aggregate per logical session per user?

AApply session windows with a gap duration of 20 minutes keyed by user, with allowed lateness of 10 minutes. Correct
BApply fixed windows of 20 minutes keyed by user, with allowed lateness of 10 minutes and accumulating panes.
CApply sliding windows of 20 minutes with a 1-minute period, with allowed lateness of 10 minutes.
DApply global windows with an early trigger every 20 minutes, with allowed lateness of 10 minutes.

Choose Beam session windows when boundaries are defined by gaps in user activity rather than wall-clock intervals. Session windows in Apache Beam are dynamically created per key based on the inactivity gap between event timestamps. When a new event arrives within the gap of an existing window for that key, the window is extended, otherwise a new session is opened. This matches the requirement of one aggregate per logical user session of any duration, and allowed lateness keeps the window state alive long enough to absorb late events.

Why A is correct: Session windows are data-driven and group events for the same key whenever the gap between successive event timestamps is below the configured duration, producing exactly one window per logical session of activity.

Why B is wrong: Fixed windows split a long browsing session into arbitrary 20-minute buckets aligned to wall-clock time, so a single user session straddling a boundary is reported as two aggregates rather than one logical session.

Why C is wrong: Sliding windows emit overlapping aggregates and produce many panes per user, which is appropriate for moving averages but not for a one-aggregate-per-session contract because each event belongs to multiple windows.

Why D is wrong: The global window groups all events into a single window per key and relies on triggers for emission, which cannot express the inactivity-gap semantics of a session and would mix events from unrelated sessions.

Other domains in this exam

Designing Data Processing Systems22% of the exam
Storing Data20% of the exam
Preparing and Using Data for Analysis15% of the exam
Maintaining and Automating Data Workloads18% of the exam

See also the PDE cert hub, the study guide, and the cheat sheet.