Objectives in this domain

Determine a strategy to improve operational excellence with Amazon CloudWatch monitoring, logging, alerting and automatic remediation of recurring failures.
Section 3.1medium
Improve existing deployment processes by adopting blue/green and rolling strategies and configuration management automation with AWS Systems Manager.
Section 3.1hard
Improve the security of an existing solution by hardening secrets management with AWS Secrets Manager, auditing for least privilege and enforcing automated compliance with AWS Config.
Section 3.2hard
Design patch management, backup and automated vulnerability remediation for an existing environment using AWS Systems Manager, Amazon Inspector and AWS Backup.
Section 3.2medium
Improve the performance of an existing solution using Amazon CloudFront, AWS Global Accelerator, caching and identification of bottlenecks against measurable SLAs and KPIs.
Section 3.3hard
Improve reliability by remediating single points of failure, enabling data replication and self-healing, and resolving service quota and scaling limits.
Section 3.4hard
Translate business requirements into measurable metrics and rightsize existing resources using AWS Compute Optimizer and CloudWatch to remove waste while protecting performance.
Section 3.3medium
Identify cost optimisation opportunities in running workloads by analysing AWS Cost and Usage Reports, eliminating unused resources and setting billing alarms and tagging.
Section 3.5medium

Sample question from this domain

Free sampleContinuous Improvement for Existing Solutionsmedium

A payments company runs a fleet of Amazon EC2 instances behind an Application Load Balancer. About once a week a memory leak causes individual instances to stop responding to health checks, and an on-call engineer currently logs in to reboot the affected instance, which takes around twenty minutes overnight. The team wants the recovery to happen automatically the moment an instance becomes unhealthy, with no standing servers added and no custom code to patch. Which approach MOST efficiently remediates the recurring failure?

APublish a custom memory metric from each instance and create an Amazon CloudWatch alarm that emails the on-call rota through Amazon SNS when the metric breaches the threshold so the engineer can respond sooner during the night.
BCreate an Amazon CloudWatch alarm on the StatusCheckFailed metric for each instance and configure the EC2 instance-recovery alarm action to reboot the instance automatically as soon as the alarm enters the ALARM state. Correct
CMove the workload into an Amazon EC2 Auto Scaling group with a target tracking policy on average CPU utilisation so that capacity scales out and replaces the failing instances during the weekly memory event.
DSchedule an AWS Lambda function with Amazon EventBridge to run every fifteen minutes, list the fleet, and reboot any instance whose health check has been failing, writing the action to a log group for audit.

Use a CloudWatch alarm with a native EC2 recovery action to remediate an unhealthy instance automatically without custom code or added servers. Amazon CloudWatch alarms can invoke built-in EC2 actions such as reboot or recover when a status-check metric breaches its threshold, so remediation happens automatically the moment the failure is detected. This removes the overnight manual reboot without adding standing infrastructure or custom automation that the team would have to maintain, which polling functions, alert-only alarms and load-based scaling cannot achieve.

Why A is wrong: Faster paging still depends on a human logging in to reboot the instance, so it shortens but does not remove the manual recovery the team wants to eliminate, and it does not act automatically.

Why B is correct: A CloudWatch alarm with a built-in EC2 action triggers the recovery the instant the status check fails, so the unhealthy instance is rebooted automatically with no added servers and no custom code for the team to maintain.

Why C is wrong: Target tracking on CPU scales for load, not for an unhealthy host, so a leaking instance that still consumes CPU would not be replaced, and this changes the architecture rather than directly remediating the fault.

Why D is wrong: A polling Lambda adds custom code the team must patch and can leave an instance unhealthy for up to fifteen minutes between runs, so it is slower and higher overhead than a native alarm action.

Other domains in this exam

Design Solutions for Organizational Complexity26% of the exam
Design for New Solutions29% of the exam
Accelerate Workload Migration and Modernization20% of the exam

See also the SAP-C02 cert hub, the study guide, and the cheat sheet.