Run a Processing Job with Apache Spark
Apache Spark is a unified analytics engine for large-scale data processing. Amazon SageMaker AI
provides prebuilt Docker images that include Apache Spark and other dependencies needed
to run distributed data processing jobs. The following provides an example on how to
run a Amazon SageMaker Processing job using Apache Spark.
With the Amazon SageMaker Python SDK, you can easily apply data transformations and extract
features (feature engineering) using the Spark framework. For information about using
the SageMaker Python SDK to run Spark processing jobs, see Data Processing with Spark in the Amazon SageMaker Python
SDK.
A code repository that contains the source code and Dockerfiles for the
Spark images is available on GitHub.
You can use the sagemaker.spark.PySparkProcessor or sagemaker.spark.SparkJarProcessor class to run your Spark
application inside of a processing job. Note you can set MaxRuntimeInSeconds to a
maximum runtime limit of 5 days. With respect to execution time, and number of
instances used, simple spark workloads see a near linear relationship between the
number of instances vs. time to completion.
The following code example shows how to run a processing job that invokes
your PySpark script preprocess.py.
- SageMaker Python SDK v3
-
from sagemaker.core.spark.processing import PySparkProcessor
spark_processor = PySparkProcessor(
base_job_name="spark-preprocessor",
framework_version="2.4",
role=role,
instance_count=2,
instance_type="ml.m5.xlarge",
max_runtime_in_seconds=1200,
)
spark_processor.run(
submit_app="preprocess.py",
arguments=['s3_input_bucket', bucket,
's3_input_key_prefix', input_prefix,
's3_output_bucket', bucket,
's3_output_key_prefix', output_prefix],
)
- SageMaker Python SDK v2 (Legacy)
-
from sagemaker.spark.processing import PySparkProcessor
spark_processor = PySparkProcessor(
base_job_name="spark-preprocessor",
framework_version="2.4",
role=role,
instance_count=2,
instance_type="ml.m5.xlarge",
max_runtime_in_seconds=1200,
)
spark_processor.run(
submit_app="preprocess.py",
arguments=['s3_input_bucket', bucket,
's3_input_key_prefix', input_prefix,
's3_output_bucket', bucket,
's3_output_key_prefix', output_prefix]
)
For an in-depth look, see the Distributed Data Processing with Apache Spark and
SageMaker Processing
example notebook.
If you are not using the Amazon SageMaker AI
Python SDK and one of its Processor classes to retrieve the pre-built
images, you can retrieve these images yourself. The SageMaker prebuilt Docker images
are stored in Amazon Elastic Container Registry (Amazon ECR). For a complete list of
the available pre-built Docker images, see the available images document.
To learn more about using the SageMaker Python SDK with Processing containers,
see Amazon SageMaker AI Python
SDK.