Reboot AWS instance on the failed status check
I know, I know, this is not the best practice. The instance should be configured so it will work properly and status checks will always pass. The problem is that on some of our test instances we are testing Docker images and from time to time they are causing the instance to hang.
The internal cron job on the instance itself was not enough – once the server was inoperative, the cron jobs were not executed at all. As the solution for this issue, I implemented the Lambda function to check the instance state and reboot it if needed.
Step one – IAM configuration
In order for Lambda to have an access to our EC2 instances and to be able to reboot them, it should be given the IAM Role. For the role to have these abilities, you need IAM Policy. So, let’s start with the IAM Policy. Open IAM Console and go to Policies. Once there, click the Create Policy button.
Because our policy is rather simple, we can switch to JSON view and paste the policy contents:
{ "Version": "2012-10-17", "Statement": [ { "Sid": "RebootDescribePolicy", "Effect": "Allow", "Action": [ "ec2:RebootInstances", "ec2:DescribeInstanceStatus" ], "Resource": "*" } ] }
Click Review Policy and as the next step, set the name for the policy. Once the name is set, click Create Policy at the bottom of the page and you are ready to create the Role.
Switch the IAM console to Roles and click Create Role. When asked to Chose the use case, click Lambda:
Click Next: Permissions and from the list of policies select the one you created in the previous step. You can now click Next: Tags and set tags if you want. Once tags are configured, click Next: Review set the role name and click Create Role.
With the above steps, we prepared all elements required for the Lambda functions to run properly when it comes to security permissions.
Step two – the Lambda function itself
Now you can switch to the Lambda console. Switch to Functions and click the Create Function button. You will have to fill the form with the function name, set the Runtime to Python 3.8, and last but not least, change the default execution role to the one you created in the previous step.
Now, the function itself. In the Function Code window, you can paste the following code:
import json import boto3 # # region_name - the region that should be covered by the scheduler # region_name='us-east-1' # # instance_to_check - set the ID of the instance to check # instance_to_check = { 'instance_id': 'i-1234567890' } def lambda_handler(event, context): ec2 = boto3.resource('ec2', region_name=region_name) ec2_client = boto3.client('ec2', region_name=region_name) for status in ec2.meta.client.describe_instance_status( InstanceIds=[ instance_to_check['instance_id'] ] )['InstanceStatuses']: in_status = status['InstanceStatus']['Details'][0]['Status'] sys_status = status['SystemStatus']['Details'][0]['Status'] # check statuses if ((in_status != 'passed') or (sys_status != 'passed')): print('Reboot required') ec2_client.reboot_instances(InstanceIds=[ instance_to_check['instance_id'] ], DryRun=False) return { 'statusCode': 200, 'body': json.dumps('Done') }
Please remember to set the proper region and proper instance ID in the configuration section at the top of the code. As the next step, you should click Deploy at the top of the code editor. The code is now ready to work. You can perform a test by using a Test button on top of the page.
For the test, it does not matter what data is sent to the function. It is working without any external data. All required configuration details are in the code. Again – this is not the best practice. It is better to configure Region and Instance ID as Environmental Variables but I want to keep this example as simple as possible.
Step three – scheduler
I wanted this function to be executed at regular intervals. In order to achieve this, go to the top of the page, to the Designer part, and click the Add Trigger button.
Select the EventBridge event type and fill the form selecting Create a new rule in the Rule box, set the rule name, and set the rate (in my example – it is executed every 30 minutes).
Once the form is filled, click the Add button and your Lambda is ready. It will be executed in the intervals selected in the trigger.
I suggest not to run it too frequently. Sometimes it takes a minute or two for the instance to start. This means that if Lambda will run every minute, it will be executed before the instance started. Since such an instance status checks are in the “Initializing” state, it will cause an immediate reboot. The 15 minutes interval or even less for not-that-crucial servers is good enough.