View a markdown version of this page

Enable checkpointing - Amazon SageMaker AI

Enable checkpointing

After you enable checkpointing, SageMaker AI saves checkpoints to Amazon S3 and syncs your training job with the checkpoint S3 bucket. You can use either S3 general purpose or S3 directory buckets for your checkpoint S3 bucket.

Architecture diagram of writing checkpoints during training.

The following example shows how to configure checkpoint paths when you construct a SageMaker AI training object.

SageMaker Python SDK v3

To enable checkpointing, add the checkpoint_config parameter to your ModelTrainer. The following example template shows how to create a SageMaker AI ModelTrainer and enable checkpointing. You can use this template for any supported algorithm by specifying the training_image parameter. To find Docker image URIs for algorithms with checkpointing supported by SageMaker AI, see Docker Registry Paths and Example Code. In V3, the unified ModelTrainer class replaces all framework-specific estimator classes (TensorFlow, PyTorch, HuggingFace, XGBoost, etc.).

from sagemaker.train import ModelTrainer from sagemaker.train.configs import Compute, CheckpointConfig from sagemaker.core.helper.session_helper import Session bucket = Session().default_bucket() base_job_name = "sagemaker-checkpoint-test" checkpoint_in_bucket = "checkpoints" # The S3 URI to store the checkpoints checkpoint_s3_bucket = "s3://{}/{}/{}".format(bucket, base_job_name, checkpoint_in_bucket) model_trainer = ModelTrainer( training_image="<ecr_path>/<algorithm-name>:<tag>", role=role, compute=Compute(instance_type="ml.m5.xlarge", instance_count=1), base_job_name=base_job_name, checkpoint_config=CheckpointConfig( s3_uri=checkpoint_s3_bucket, local_path="/opt/ml/checkpoints" ) )

The checkpoint_config parameter accepts a CheckpointConfig object with the following fields:

  • local_path – The local path where the model saves the checkpoints periodically in a training container. The default path is set to '/opt/ml/checkpoints'. If you are using other frameworks or bringing your own training container, ensure that your training script's checkpoint configuration specifies the path to '/opt/ml/checkpoints'.

    Note

    We recommend specifying the local paths as '/opt/ml/checkpoints' to be consistent with the default SageMaker AI checkpoint settings. If you prefer to specify your own local path, make sure you match the checkpoint saving path in your training script and the local_path in your CheckpointConfig.

  • s3_uri – The URI to an S3 bucket where the checkpoints are stored in real time. You can specify either an S3 general purpose or S3 directory bucket to store your checkpoints. For more information on S3 directory buckets, see Directory buckets in the Amazon Simple Storage Service User Guide.

To find a complete list of SageMaker AI ModelTrainer parameters, see the ModelTrainer API in the Amazon SageMaker Python SDK documentation.

SageMaker Python SDK v2 (Legacy)

To enable checkpointing, add the checkpoint_s3_uri and checkpoint_local_path parameters to your estimator. The following example template shows how to create a generic SageMaker AI estimator and enable checkpointing. You can use this template for the supported algorithms by specifying the image_uri parameter. To find Docker image URIs for algorithms with checkpointing supported by SageMaker AI, see Docker Registry Paths and Example Code. You can also replace estimator and Estimator with other SageMaker AI frameworks' estimator parent classes and estimator classes, such as TensorFlow, PyTorch, MXNet, HuggingFace and XGBoost.

import sagemaker from sagemaker.estimator import Estimator bucket=sagemaker.Session().default_bucket() base_job_name="sagemaker-checkpoint-test" checkpoint_in_bucket="checkpoints" # The S3 URI to store the checkpoints checkpoint_s3_bucket="s3://{}/{}/{}".format(bucket, base_job_name, checkpoint_in_bucket) # The local path where the model will save its checkpoints in the training container checkpoint_local_path="/opt/ml/checkpoints" estimator = Estimator( ... image_uri="<ecr_path>/<algorithm-name>:<tag>" # Specify to use built-in algorithms output_path=bucket, base_job_name=base_job_name, # Parameters required to enable checkpointing checkpoint_s3_uri=checkpoint_s3_bucket, checkpoint_local_path=checkpoint_local_path )

The following two parameters specify paths for checkpointing:

  • checkpoint_local_path – Specify the local path where the model saves the checkpoints periodically in a training container. The default path is set to '/opt/ml/checkpoints'. If you are using other frameworks or bringing your own training container, ensure that your training script's checkpoint configuration specifies the path to '/opt/ml/checkpoints'.

    Note

    We recommend specifying the local paths as '/opt/ml/checkpoints' to be consistent with the default SageMaker AI checkpoint settings. If you prefer to specify your own local path, make sure you match the checkpoint saving path in your training script and the checkpoint_local_path parameter of the SageMaker AI estimators.

  • checkpoint_s3_uri – The URI to an S3 bucket where the checkpoints are stored in real time. You can specify either an S3 general purpose or S3 directory bucket to store your checkpoints. For more information on S3 directory buckets, see Directory buckets in the Amazon Simple Storage Service User Guide.

To find a complete list of SageMaker AI estimator parameters, see the Estimator API in the Amazon SageMaker Python SDK documentation.