How to do Image Processing with GPUs¶
Overview¶
To demonstrate the capability of running a distributed job in PySpark using a GPU, this example uses NumbaPro and the CUDA platform for image analysis. This example executes 2-dimensional FFT convolution on images in grayscale and compares the execution time of CPU-based and GPU-based calculations.
Who is this for?¶
This how-to is for users of a Spark cluster who wish to run Python code using the YARN resource manager. This how-to will show you how to integrate third-party Python libraries with Spark.
Image processing summary¶
Before you start¶
For this example, you’ll need Spark running with the YARN resource manager. You can install Spark and YARN using an enterprise Hadoop distribution such as Cloudera CDH or Hortonworks HDP.
You will also need valid Amazon Web Services (AWS) credentials in order to download the example data.
For this example, we recommend the use of the GPU-enabled AWS instance type
g2.2xlarge
and the AMI ami-12fd8178
(us-east-1 region), which has CUDA
7.0 and the NVIDIA drivers pre-installed. An example profile (to be placed in
~/.acluster/profiles.d/gpu_profile.yaml
) is shown below:
name: gpu_profile
node_id: ami-12fd8178 # Ubuntu 14.04, Cuda 7.0, us-east-1 region
node_type: g2.2xlarge
num_nodes: 4
provider: aws_east
user: ubuntu
To execute this example, download the:
spark-numbapro.py example script
or
spark-numbapro.ipynb example notebook
.
If you wish to use the spark-numbapro.ipynb example notebook
the Jupyter Notebook plugin can be installed on the cluster using the following
command:
acluster install notebook
Once the Jupyter Notebook plugin is installed, you can view Jupyter Notebook in your browser using the following command:
acluster open notebook
Install dependencies¶
If you have permission to install packages with acluster
you can install
the required packages on all nodes using the following command.
acluster conda install scipy matplotlib numbapro PIL
Load data into HDFS¶
First, we will load the sample text data into the HDFS data store. The following script will transfer sample image data (approximately 1.1 GB) from a public Amazon S3 bucket to the HDFS data store on the cluster.
Download the cluster-download-data.py script
to your
cluster and insert your Amazon AWS credentials in the AWS_KEY
and AWS_SECRET
variables.
import subprocess
AWS_KEY = 'YOUR_AWS_KEY'
AWS_SECRET = 'YOUR_AWS_SECRET'
s3_path = 's3n://{0}:{1}@blaze-data/dogs-cats-img/images'.format(AWS_KEY, AWS_SECRET)
cmd = ['hadoop', 'distcp', s3_path, 'hdfs:///tmp/dogs-cats']
subprocess.call(cmd)
Note: The hadoop distcp
command might cause HDFS to fail on smaller instance sizes due to memory limits.
Run the cluster-download-data.py
script on the cluster.
python cluster-download-data.py
After a few minutes, the image data will be in the HDFS data store on the cluster and ready for analysis.
Running the Job¶
Run the spark-numbapro.py
script on the Spark cluster using
spark-submit.
The output shows the image processing execution times for the
CPU-based vs. GPU-based calculations.
54.164.123.31: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/11/09 02:33:21 INFO SparkContext: Running Spark version 1.5.1
[...]
15/11/09 02:33:45 INFO TaskSetManager: Finished task 6.0 in stage 1.0 (TID 13)
in 106 ms on ip-172-31-9-24.ec2.internal (7/7)
15/11/09 02:33:45 INFO YarnScheduler: Removed TaskSet 1.0, whose tasks have
all completed, from pool
15/11/09 02:33:45 INFO DAGScheduler: ResultStage 1
(collect at /tmp/anaconda-cluster/spark-numbapro.py:106) finished in 4.844 s
15/11/09 02:33:45 INFO DAGScheduler: Job 1 finished:
collect at /tmp/anaconda-cluster/spark-numbapro.py:106, took 4.854970 s
10 images
CPU: 6.91735601425
GPU: 4.88133311272
[...]
15/11/09 02:34:27 INFO TaskSetManager: Finished task 255.0 in stage 3.0 (TID 525)
in 139 ms on ip-172-31-9-24.ec2.internal (256/256)
15/11/09 02:34:27 INFO YarnScheduler: Removed TaskSet 3.0, whose tasks have
all completed, from pool
15/11/09 02:34:27 INFO DAGScheduler: ResultStage 3
(collect at /tmp/anaconda-cluster/spark-numbapro.py:126) finished in 19.340 s
15/11/09 02:34:27 INFO DAGScheduler: Job 3 finished:
collect at /tmp/anaconda-cluster/spark-numbapro.py:126, took 19.400670 s
500 images
CPU: 22.1282501221
GPU: 19.8209779263
Troubleshooting¶
If something goes wrong consult the FAQ / Known issues page.
Further information¶
See the Spark and PySpark documentation pages for more information.
For more information on NumbaPro see the NumbaPro documentation.