AWS Instance monitoring for SSM status/instance real health with Lambda + Eventbridge

Published in

AWS Tip

6 min readJan 5, 2025

In most enterprise use cases, compliance requirements mandate that all instances be managed through AWS Systems Manager. This approach offers numerous benefits, such as leveraging Fleet Manager to control instances, executing tasks with Run Command, using Patch Manager for updates, and, most importantly, accessing the instance shell through Session Manager.

Sounds great, right? However, there’s a fundamental problem when you start investigating how these functionalities are achieved. As Murphy’s Law states, “Anything that can go wrong, will go wrong.” Here’s my experience in production.

Sometimes, an instance becomes unresponsive even though its health check status remains “healthy.” This issue arises due to the way EC2 health checks work. The checks primarily assess basic network reachability and low-level system responsiveness. The kernel prioritizes critical low-level tasks, such as handling networking requests, over user-space processes like SSH or application services. Consequently, even if critical services fail or resource exhaustion (e.g., high CPU or RAM usage) occurs, the instance may still appear “healthy” because the underlying system and network stack remain partially functional.

In practical terms, this renders typical CloudWatch alerts like “StatusCheckFailed” ineffective. This is where AWS Systems Manager (SSM) comes into play. All SSM functionalities depend on the amazon-ssm-agent running on the instance. However, during resource exhaustion, non-critical services (from EC2’s perspective) are often terminated. If you’re lucky, a few critical services your application relies on may remain intact. For example, I’ve encountered instances where the HAProxy service continued running in a “zombie” instance, while SSH and SSM were completely unresponsive. This instance doged detection for like a day because we thought it was working since it is still sending traffic. It was only when AWS config notified us that SSM is not enabled did we start to investigate it.

Of course, you could configure CloudWatch alerts to send SNS notifications when CPU or RAM usage exceeds a certain threshold. But this approach isn’t foolproof. The instance might or might not have become a “zombie,” and waking the operations team in the middle of the night just to check isn’t ideal.

After digging around, I realized that AWS Systems Manager’s Fleet Manager provides an intuitive indicator. The console shows whether the SSM agent’s status is “Online” or “Disconnected.” Since SSM is AWS’s primary tool for managing EC2 instances, this status serves as a reliable indicator of an instance’s real health. If the SSM agent is offline, it’s a clear sign the instance is experiencing issues.

I assumed it would be straightforward to set up a rule that detects a SSM status change and triggers notifications. Unfortunately, I was wrong. After reaching out to AWS Support, I learned that, as of January 2025, AWS does not provide a simple way to create such rules directly through the console.

The only practical solution is to create a custom workflow using another service, such as AWS Lambda, to query the SSM agent’s status via the AWS SDK (e.g., Boto3 for Python). You can then use Amazon EventBridge to schedule a periodic cron job — say, every 15 minutes — to invoke the Lambda function. While this introduces a time lag between checks, it sort of gets the job done in a around about way…

Set up

Lambda

I’ll be implementing a simple logic of checking if SSM is online, if not, just reboot/terminate the instance. You can however add your own custom logic to suit your workflow. For instance, you could add paramiko (python) for SSH support to check the instance SSH first then reboot.

Note that for stand alone instances, we can directly reboot. For instances in Auto Scaling Groups (simple ASG or nodes of ECS/EKS etc), we need to terminate them. The ASG/ECS/EKS will automatically spin up a new node.

import boto3
import json
import os
import logging
import time

# Configure logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

REGION = os.environ.get('REGION')

def handle_instance_health(event):
    """
    Monitor EC2 instance health and remediate issues based on SSM status
    """
    ec2_resource = boto3.resource('ec2',region_name=REGION)
    ec2_client = boto3.client('ec2',region_name=REGION)
    ssm_client = boto3.client('ssm',region_name=REGION)
    autoscaling_client = boto3.client('autoscaling',region_name=REGION)
    sns_client = boto3.client('sns',region_name=REGION)

    # Configuration 
    SNS_TOPIC_ARN = os.environ.get('SNS_TOPIC_ARN')
    EXCLUDED_INSTANCES_STR = os.environ.get('EXCLUDED_INSTANCES', '')
    excluded_instances = [inst.strip() for inst in EXCLUDED_INSTANCES_STR.split(',') if inst.strip()]

    try:
        # Get all running instances except excluded ones
        running_instances = {
            instance.id: instance
            for instance in ec2_resource.instances.all()
            if instance.state['Name'] == 'running' and instance.id not in excluded_instances
        }

        if not running_instances:
            return "No running instances to monitor"

        # Get SSM status for all instances
        ssm_response = ssm_client.describe_instance_information()
        time.sleep(5)
        
        # Get instance status details
        instance_statuses = ec2_client.describe_instance_status(
            InstanceIds=list(running_instances.keys()),
            IncludeAllInstances=True
        )
        time.sleep(5)

        # Identify managed instances in Auto Scaling Groups
        asg_instances = set()
        try:
            asg_response = autoscaling_client.describe_auto_scaling_instances()
            time.sleep(5)
            asg_instances = {inst['InstanceId'] for inst in asg_response.get('AutoScalingInstances', [])}
        except Exception as e:
            print(f"ASG Lookup Error: {e}")

        for status in instance_statuses['InstanceStatuses']:
            instance_id = status['InstanceId']
            instance = running_instances.get(instance_id)

            # Check SSM status
            ssm_info_list = ssm_response.get("InstanceInformationList",[])
            ssm_ping_status = "Missing"
            for instance_info in ssm_info_list:
                if instance_info["InstanceId"] == instance_id:
                    ssm_ping_status = instance_info["PingStatus"]
                    break

            # Get instance name from tags
            instance_name = "N/A"
            if instance and instance.tags:
                name_tags = [tag['Value'] for tag in instance.tags if tag['Key'] == 'Name']
                if name_tags:
                    instance_name = name_tags[0]

            # If SSM status is not online, take remediation action
            if ssm_ping_status != 'Online':
                message = (
                    f"Instance Health Check Issue:\n"
                    f"Instance ID: {instance_id}\n"
                    f"Instance Name: {instance_name}\n"
                    f"SSM Ping Status: {ssm_ping_status}\n"
                )

                try:
                    # Remediation for ASG instances
                    if instance_id in asg_instances:
                        # Terminate instance, ASG will replace
                        asg_details = autoscaling_client.describe_auto_scaling_instances(
                            InstanceIds=[instance_id]
                        )
                        asg_name = asg_details['AutoScalingInstances'][0]['AutoScalingGroupName']
                        
                        autoscaling_client.terminate_instance_in_auto_scaling_group(
                            InstanceId=instance_id,
                            ShouldDecrementDesiredCapacity=False
                        )
                        message += f"Action: Terminated in ASG {asg_name}"

                    # Fallback for standalone instances
                    else:
                        # Reboot the instance
                        instance.reboot()
                        message += "Action: Rebooted standalone instance"

                    # Send notification about remediation
                    sns_client.publish(
                        TopicArn=SNS_TOPIC_ARN,
                        Subject=f"Instance Health Remediation - {instance_id}",
                        Message=message
                    )

                except Exception as remediation_error:
                    # Notification about remediation failure
                    error_message = message + f"Remediation Failed: {str(remediation_error)}"
                    sns_client.publish(
                        TopicArn=SNS_TOPIC_ARN,
                        Subject=f"Remediation Failed - {instance_id}",
                        Message=error_message
                    )

        return "Processing complete"

    except Exception as e:
        error_message = f"Error in instance health monitoring: {str(e)}"
        print(error_message)
        
        # Send error notification
        sns_client.publish(
            TopicArn=SNS_TOPIC_ARN,
            Subject="Critical Error in Instance Health Monitoring",
            Message=error_message
        )
        raise

def lambda_handler(event, context):
    """
    Lambda entry point. Calls the main function and returns response.
    """
    try:
        result = handle_instance_health(event)
        return {
            'statusCode': 200,
            'body': json.dumps({
                'message': result
            })
        }
    except Exception as e:
        return {
            'statusCode': 500,
            'body': json.dumps({
                'error': str(e)
            })
        }

Note: I assume you have lambda permissions, SNS topic and permissions, EC2 permissions all set up correctly. The point of the post is about SSM.

EventBridge

Create a scheduled Cron job in Eventbridge

You will be prompted to give the schedule a name, description etc. Then set up the rule. You want to make sure it’s recurring. For me, I set it as every 15 min, but this is up to your customisation.

As for Rule target, choose the lambda we created earlier on. You can leave the payload blank since our lambda does not require a payload.

As for optional settings, I’ll just leave the Action after schedule completion as None. An since this is a recurring script, I will also turn off Retry policy. I’ll leave the rest as default and save it.

You can customise the rule to your liking. Once done, just select the Create Schedule confirmation and the rule will be up.

Finally, I just want to rant a bit. It seems like such a easy and intuitive function for the AWS SSM team to just include by default. You already have a page on SSM Fleet Manager that detects the SSM Status. Why not just expose this status to Eventbridge? It takes a lot of time to investigate, come up with the lambda script, test, then implement Eventbridge. It’s also a waste of customers money (invoking lambda periodically) when it could more targeted with the event of SSM disconnecting triggering the EventBridge. I know it’s a small thing, but how many millions of EC2’s are on SSM? It seems like a valid concern to me.

Ps. As always, I do not know it all. If there is indeed an easy way to tackle this problem from AWS, I’ll be happy to know more.

AWS Instance monitoring for SSM status/instance real health with Lambda + Eventbridge

Set up

Lambda

EventBridge

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in AWS Tip

Written by Mike Sun

No responses yet