Tracking Failed ECS Tasks

May 8, 2021 - 6 mins read

Prologue

Many times we have ECS tasks which fail either while launching or later. In production environments, it becomes extremely crucial to keep a track of tasks failing. If we can find identify such failing tasks, we can perform Root Cause Analysis (RCA) and address many issues which can prevent such failures in future, making the production more resilient.

Although, tasks running as part of an ECS service, are logged in service events, but not much detail is included in the events and in some cases the task stopped reasons too are not included. Also, the service events logs service specific task details and can log as many as 100 events, so it’s likely that older failed task details are not available. So it can be challenging to keep a track of such tasks, as ECS does not provide a direct console to check failed tasks, especially if the production environment has large number of tasks; and makes RCA & issue addressal further tricky. In this post, we are going to circumvent this limitation.

Fortunately, we have Amazon EventBridge service which can be useful to address this issue. Now, a detailed reading and working can be found in the official Amazon EventBridge documentation. However, succinctly put: Amazon EventBridge enables various AWS services to send events (with some payload) to the EventBridge event buses, EventBridge evaluates the event payloads and in-turn decides to trigger different target AWS services forwarding the payload. So, as the name suggests, it acts as a bridge between different AWS services, receiving events and triggering targets.

Before we move further, if you new to EventBridge, I highly recommend to get familiar with EventBridge using Getting Started guide from official AWS documentation.

Event Rules:

An event rule matches the incoming events with event pattern and determines whether or not to trigger the targets based on the result if the event pattern matches or not with the events. ECS events are created in one of the three cases: container instance state change events, task state change events, and service action events. This post will be focusing primarily on task state events as the failures relate to the tasks use task state events.

ECS task state creates events in any one of the cases when a task is started/sopped via an ECS service or independently using RunTask/StartTask or StopTask API call; task launch fails; container in the task changes state. In each of the cases of task state events, the event has a unique set of key-value pairs which can be used in an event pattern to identify tasks with issues with fine grain control on tasks belonging to a specific ECS cluster and/or ECS service within the cluster.

Event Patterns:

This section henceforth, we are going through examples of different event patterns to capture task events with specific issues.

Here is a sample ECS task event when container exited with exit code 127:

{
  "version":"0",
  "id":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
  "detail-type":"ECS Task State Change",
  "source":"aws.ecs",
  "account":"xxxxxxxxxxxx",
  "time":"2021-05-08T11:58:12Z",
  "region":"us-east-1",
  "resources":["arn:aws:ecs:us-east-1:xxxxxxxxxxxx:task/default/xxxxxxxxxxxxxxxxxxxxxxxx"],
  "detail":{
    "clusterArn":"arn:aws:ecs:us-east-1:xxxxxxxxxxxx:cluster/default",
    "connectivity":"CONNECTED",
    "connectivityAt":"2021-05-08T11:57:37.852Z",
    "containers":[{
      "containerArn":"arn:aws:ecs:us-east-1:xxxxxxxxxxxx:container/default/xxxxxxxxxxxxxxxxxxxxxxxx/xxxxxx-xx-xx-xxxxxx",
      "exitCode":127,
      "lastStatus":"STOPPED",
      "name":"Nginx-Container",
      "image":"nginx",
      "runtimeId":"xxxxxxxxxxxxxxxxxxxxxxxx-2065566309",
      "taskArn":"arn:aws:ecs:us-east-1:xxxxxxxxxxxx:task/default/xxxxxxxxxxxxxxxxxxxxxxxx",
      "networkInterfaces":
        [{
          "attachmentId":"xxxxxx-xx-xx-xx-xxxxxx-xxxxxxxxx",
          "privateIpv4Address":"172.31.78.62",
          "ipv6Address":"xxxx:xxxx:xxx:fd00:84d4:xxx:xx:xxxx"
        }],
      "cpu":"0"
    }],
    "cpu":"256",
    "createdAt":"2021-05-08T11:57:33.528Z",
    "desiredStatus":"STOPPED",
    "enableExecuteCommand":false,
    "ephemeralStorage":{"sizeInGiB":20},
    "executionStoppedAt":"2021-05-08T11:58:02.463Z",
    "group":"family:test-task-definition",
    "launchType":"FARGATE",
    "lastStatus":"DEPROVISIONING",
    "memory":"512",
    "overrides":{
      "containerOverrides":[{"name":"Nginx-Container"}]
    },
    "platformVersion":"1.4.0",
    "pullStartedAt":"2021-05-08T11:57:47.724Z",
    "pullStoppedAt":"2021-05-08T11:57:59.522Z",
    "startedAt":"2021-05-08T11:58:02.494Z",
    "stoppingAt":"2021-05-08T11:58:12.541Z",
    "stoppedReason":"Essential container in task exited",
    "stopCode":"EssentialContainerExited",
    "taskArn":"arn:aws:ecs:us-east-1:xxxxxxxxxxxx:task/default/xxxxxxxxxxxxxxxxxxxxxxxx",
    "taskDefinitionArn":"arn:aws:ecs:us-east-1:xxxxxxxxxxxx:task-definition/test-task-definition:8",
    "updatedAt":"2021-05-08T11:58:12.541Z",
    "version":4
    }
}

Task failed to start issues:

Such issues happen when the task fails to start. This usually happens when there is a problem in task configurations which can cause issues in launching the containers. For example, the task does not have permissions to pull the container image or secrets from secrets manager used in container environment variables; or has incorrect network configurations due to which image could not be pulled; or there is no host instance in cluster to accommodate the task. There are plethora of reasons which could lead to task start failure.

Following event pattern can be used to capture such tasks which failed to start:

{
  "source": [
    "aws.ecs"
  ],
  "detail-type": [
    "ECS Task State Change"
  ],
  "detail": {
    "clusterArn": [
      "arn:aws:ecs:us-east-1:xxxxxxxxxxxx:cluster/clusterName"
    ],
    "stopCode": [
      "TaskFailedToStart"
    ]
  }
}

Task failed due application issue

In this case, the task starts but the eventually stops due to some issue in the essential container application.

Note: An ECS task can run as many as 10 containers as per the ECS default quotas. However, the task stops due to a stopping of a container only if the container is labeled as Essential, otherwise the task can be in running state even for a stopped non-essential container.

If there is a container exits due to a command failure, it exits with a non-zero exit code and if the container is essential, it will stop the task as well. In such case, the event can be captured using following event pattern:

{
  "source": [
    "aws.ecs"
  ],
  "detail-type": [
    "ECS Task State Change"
  ],
  "detail": {
    "clusterArn": [
      "arn:aws:ecs:us-east-1:xxxxxxxxxxxx:cluster/clusterName"
    ],
    "containers": {
      "exitCode": [{
        "anything-but": [0]
      }]
    },
    "stopCode": [
      "EssentialContainerExited"
    ]
  }
}

Note: The above pattern will be able to capture non-zero exit codes from essential containers in the task which belong to cluster bearing name “clusterName”. While using the pattern, use your cluster’s ARN.

Task failures within an ECS service

This section delves into cornering tasks which fail within an ECS service. Here, it is worth noting that the event patterns we have discussed so far are applicable to the tasks of ECS service as well, as the tasks of an ECS service have same stopCode & exitCode. However, as ECS services are logical isolation of microservices, it is worth sparing our time to discuss about tracking failing tasks with their service names.

For service name specific event pattern, we can use all the above discussed event patterns with an additional key “group” in the event pattern. Following event pattern can be used by specifying the clusterArn and ECS service name in service:<your-service-name> (replace “” with ECS service name):

{
  "source": [
    "aws.ecs"
  ],
  "detail-type": [
    "ECS Task State Change"
  ],
  "detail": {
    "clusterArn": [
      "arn:aws:ecs:us-east-1:xxxxxxxxxxxx:cluster/clusterName"
    ],
    "group": [
      "service:<your-service-name>"
    ],
    "stopCode": [
      "TaskFailedToStart"
    ]
  }
}

Targets

In essence, targets are AWS services/endpoints to which event payload is sent upon a successful event match with the event pattern. For each event rule, there can be a maximum of 5 targets. You can log the events for failing tasks for a later RCA, or use SNS topic to get an alert, say your email subscribed to the topic, or you can trigger a different service as fail-safe mechanism.

Conclusion

At this point it’s easy to see that there can be different ways in which we can log and trace failing ECS tasks and it can make troubleshooting easier and systematic. A further exploration of events and events pattern is worth sweating. You can keep the event pattern so as to match with a wider range of events using lesser filter and try to find patterns for your use-cases. For example, in ECS, following pattern will capture all the events from ECS:

{
  "source": [
    "aws.ecs"
  ]
}

The captured events can be used to contrive different patterns according to the need.