Troubleshoot memory issue

Enable errormode for the PayForAdoptions service

Update the Systems Manager Parameter to turn on error mode in PayForAdoption service.

aws ssm put-parameter --name '/petstore/errormode1' --value 'true' --overwrite

Increase the traffic generator count for additional load

We are going to launch 5 instances of the trafficgenerator container to simulate scale

PETLISTADOPTIONS_CLUSTER=$(aws ecs list-clusters | jq '.clusterArns[]|select(contains("PetList"))' -r)
TRAFFICGENERATOR_SERVICE=$(aws ecs list-services --cluster $PETLISTADOPTIONS_CLUSTER | jq '.serviceArns[]|select(contains("trafficgenerator"))' -r)
aws ecs update-service --cluster $PETLISTADOPTIONS_CLUSTER --service $TRAFFICGENERATOR_SERVICE --desired-count 5

Let’s find out what happens

Now that the traffic load has increased, wait for 10 minutes and go to the ServiceLens console. You should see the PayForAdoption service node with red color indicating trouble in the service with HTTP 500 errors.

Issue with PayForAdoptions

When you click on the node, you will see that there are plenty of 500s being thrown from the service.

Trace metrics

Click on View in Container Insights and select the PayForAdoptions service. You will see the metric widget showing the memory fluctuation in the service which is contributing to the issue.

Container Insights Mem Issue

Now click on the View traces button to see traces from that service. In the next screen, select Node status from the Filter type list and tick the checkbox that says Faults and click the Update node status filter button

Filter traces

Scroll all the way down, and click on one of the traces. The list of traces you see are all the ones that contain HTTP 500 error

Trace list

In the next screen, you can see the segments timeline showing the 500 response code from ‘payforadoption service as shown below

Look at logs

CloudWatch ServiceLens can automatically correlate traces, logs and metrics to identify root cause. Scrolling all the way down to the Logs section will reveal the actual error from the application as shown below. You can also dive deep into the Log data and analyze it by clicking on View in CloudWatch Logs Insights button.

Look at logs

The following GIF shows the sequence of actions Look at logs

Revert the application to normal behavior

Set the traffic generator count to 1 for regular load

PETLISTADOPTIONS_CLUSTER=$(aws ecs list-clusters | jq '.clusterArns[]|select(contains("PetList"))' -r)
TRAFFICGENERATOR_SERVICE=$(aws ecs list-services --cluster $PETLISTADOPTIONS_CLUSTER | jq '.serviceArns[]|select(contains("trafficgenerator"))' -r)
aws ecs update-service --cluster $PETLISTADOPTIONS_CLUSTER --service $TRAFFICGENERATOR_SERVICE --desired-count 1

Disable errormode to return to normal behavior

Update the Systems Manager Paramter to turn off error mode in PayForAdoption service.

aws ssm put-parameter --name '/petstore/errormode1' --value 'false' --overwrite

Wait for a few minutes and check out the CloudWatch Container Insights screen to observe that the highly fluctuating memory utilization behavior is now stopped.

Container Insights normal behavior