A retail analytics team ingests clickstream events into Pub/Sub and processes them with a Dataflow streaming pipeline that aggregates page views per user session, where a session ends after 20 minutes of inactivity. Sessions can run from seconds to several hours, and late events arrive up to 10 minutes behind the watermark. Which windowing strategy in Apache Beam should the team apply to compute one aggregate per logical session per user?
- AApply session windows with a gap duration of 20 minutes keyed by user, with allowed lateness of 10 minutes. Correct
- BApply fixed windows of 20 minutes keyed by user, with allowed lateness of 10 minutes and accumulating panes.
- CApply sliding windows of 20 minutes with a 1-minute period, with allowed lateness of 10 minutes.
- DApply global windows with an early trigger every 20 minutes, with allowed lateness of 10 minutes.
Why A is correct: Session windows are data-driven and group events for the same key whenever the gap between successive event timestamps is below the configured duration, producing exactly one window per logical session of activity.
Why B is wrong: Fixed windows split a long browsing session into arbitrary 20-minute buckets aligned to wall-clock time, so a single user session straddling a boundary is reported as two aggregates rather than one logical session.
Why C is wrong: Sliding windows emit overlapping aggregates and produce many panes per user, which is appropriate for moving averages but not for a one-aggregate-per-session contract because each event belongs to multiple windows.
Why D is wrong: The global window groups all events into a single window per key and relies on triggers for emission, which cannot express the inactivity-gap semantics of a session and would mix events from unrelated sessions.