Help improve this page
To contribute to this user guide, choose the Edit this page on GitHub link that is located in the right pane of every page.
Manage compute for AI/ML workloads on Amazon EKS with node groups
Tip
Register
This section covers how to manage accelerated compute (AWS Trainium, NVIDIA GPUs) for AI training and inference workloads using Amazon EKS managed node groups or self-managed nodes.
EKS managed node groups and self-managed nodes use EC2 Auto Scaling Groups (ASG). EKS managed node groups have dedicated EKS APIs for creating, updating, and deleting nodes, and also have node repair functionality and lifecycle termination hooks built-in. EKS self-managed nodes are deployed and managed directly through EC2 APIs.
With these options, you define the instance type, desired count, scaling boundaries, and EC2 launch template upfront. Consider using EKS managed node groups or self-managed nodes if you also have non-EKS workloads and prefer configuration consistency through EC2 launch templates. EKS node groups are a fit for training and fine-tuning workloads where the accelerated compute footprint is known in advance. Note, both EKS Auto Mode and Karpenter also support static capacity provisioning, see Manage compute for AI/ML workloads with EKS Auto Mode and Karpenter for more information.
EKS managed node groups and self-managed nodes support all accelerated compute purchase options (On-Demand, Spot, On-Demand Capacity Reservations, Capacity Blocks for ML). You create a separate managed or self-managed node group per capacity type, each with its own launch template, instance types, and scaling configuration. This gives you explicit, ASG-backed control over each capacity pool without heterogeneous dynamic provisioning logic.
EKS managed node groups vs. self-managed nodes
Choosing between EKS managed node groups and self-managed nodes depends on the level of customization and control you require. EKS managed node groups allow for a subset of EC2 launch template customization, whereas self-managed nodes support the full breadth of the EC2 launch template. If you don’t have a specific reason to customize and manage the node lifecycle yourself, start with EKS managed node groups and only move to self-managed nodes when a specific requirement forces it.
Use managed node groups when: You want EKS to handle AMI selection, node bootstrapping, rolling updates, node repair, and graceful drain workflows on your behalf. EKS managed node groups are the recommended starting point if you do not prefer to use EKS Auto Mode or Karpenter for training and inference workloads. When using Capacity Blocks for ML, EKS managed node groups automatically create a scheduled scaling policy that drains the node group 40 minutes before the reservation ends, removing the need to use the AWS Node Termination Handler
Use self-managed node groups when: You need full control over the EC2 launch template, AMI, kernel parameters, container runtime configuration, or custom bootstrap scripts. Common ML scenarios include tuning kernel and NIC settings for distributed training with Elastic Fabric Adapter (EFA), or integrating with a custom node lifecycle controller. Self-managed nodes give you the flexibility to ship any user data and IAM instance profile you need, but you take on responsibility for updates, scheduled scaling policies, and lifecycle hooks such as the AWS Node Termination Handler
Reserve GPUs with Capacity Blocks for ML
Capacity Blocks for machine learning (ML) allow you to reserve GPU instances on a future date for time-bound training or inference workloads. For more information, see Capacity Blocks for ML in the Amazon EC2 User Guide.
You can use Capacity Block reservations through EKS managed node groups and self-managed nodes. The EC2 launch template configuration is the same in both cases. The node creation workflow, scale-down behavior, and lifecycle hooks for workload termination differ across the provisioning options.
Considerations
-
Capacity Blocks are only available for certain Amazon EC2 instance types and AWS Regions. See Work with Capacity Blocks Prerequisites for more information.
-
Capacity Blocks are zonal. During node group creation, you must use the subnet in the same Availability Zone (AZ) as the Capacity Block reservation.
-
If you create a node group before the Capacity Block reservation becomes active, set the desired capacity to
0during node group creation. -
To allow time for graceful workload draining, schedule scale-to-zero more than 30 minutes before the Capacity Block reservation ends. EC2 begins shutting down instances 30 minutes before the reservation ends.
Create node groups with Capacity Blocks for ML
EKS managed node groups and self-managed nodes require using a custom EC2 launch template that targets the Capacity Block reservation. The following shows the minimal required fields for EKS managed node groups and self-managed nodes. Additional fields are required for self-managed nodes as shown in the Self-managed nodes steps below.
The LaunchTemplateData must include:
-
InstanceMarketOptionswithMarketTypeset to"capacity-block" -
CapacityReservationSpecification: CapacityReservationTargetwithCapacityReservationIdset to the Capacity Block ID. For example,cr-0123456789abcdef0. -
InstanceTypeset to the instance type of your Capacity Block reservation. For example,p5.48xlarge.
These requirements are shown in the examples below for creating the launch template for EKS managed node groups and self-managed nodes.