A payments company runs a fleet of Amazon EC2 instances behind an Application Load Balancer. About once a week a memory leak causes individual instances to stop responding to health checks, and an on-call engineer currently logs in to reboot the affected instance, which takes around twenty minutes overnight. The team wants the recovery to happen automatically the moment an instance becomes unhealthy, with no standing servers added and no custom code to patch. Which approach MOST efficiently remediates the recurring failure?
- APublish a custom memory metric from each instance and create an Amazon CloudWatch alarm that emails the on-call rota through Amazon SNS when the metric breaches the threshold so the engineer can respond sooner during the night.
- BCreate an Amazon CloudWatch alarm on the StatusCheckFailed metric for each instance and configure the EC2 instance-recovery alarm action to reboot the instance automatically as soon as the alarm enters the ALARM state. Correct
- CMove the workload into an Amazon EC2 Auto Scaling group with a target tracking policy on average CPU utilisation so that capacity scales out and replaces the failing instances during the weekly memory event.
- DSchedule an AWS Lambda function with Amazon EventBridge to run every fifteen minutes, list the fleet, and reboot any instance whose health check has been failing, writing the action to a log group for audit.
Why A is wrong: Faster paging still depends on a human logging in to reboot the instance, so it shortens but does not remove the manual recovery the team wants to eliminate, and it does not act automatically.
Why B is correct: A CloudWatch alarm with a built-in EC2 action triggers the recovery the instant the status check fails, so the unhealthy instance is rebooted automatically with no added servers and no custom code for the team to maintain.
Why C is wrong: Target tracking on CPU scales for load, not for an unhealthy host, so a leaking instance that still consumes CPU would not be replaced, and this changes the architecture rather than directly remediating the fault.
Why D is wrong: A polling Lambda adds custom code the team must patch and can leave an instance unhealthy for up to fifteen minutes between runs, so it is slower and higher overhead than a native alarm action.