Running PySpark with the YARN resource manager

This example runs a script on the Spark cluster with the YARN resource manager and returns the hostname of each node in the cluster.

Who is this for?

This example is for users of a Spark cluster who wish to run a PySpark job using the YARN resource manager.

Before you start

Download the spark-yarn.py example script to your cluster.

You need Spark running with the YARN resource manager. You can install Spark and YARN using an enterprise Hadoop distribution such as Cloudera CDH or Hortonworks HDP.

Running the job

This code is almost the same as the code on the page Running PySpark as a Spark standalone job, which describes the code in more detail.

Here is the complete script to run the Spark + YARN example in PySpark:

# spark-yarn.py
from pyspark import SparkConf
from pyspark import SparkContext

conf = SparkConf()
conf.setMaster('yarn-client')
conf.setAppName('spark-yarn')
sc = SparkContext(conf=conf)


def mod(x):
    import numpy as np
    return (x, np.mod(x, 2))

rdd = sc.parallelize(range(1000)).map(mod).take(10)
print rdd

NOTE: You may need to install NumPy on the cluster nodes using adam scale -n cluster conda install numpy.

Run the script on the Spark cluster with spark-submit. The output shows the first 10 values that were returned from the spark-basic.py script.

16/05/05 22:26:53 INFO spark.SparkContext: Running Spark version 1.6.0

[...]

16/05/05 22:27:03 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 3242 bytes)
16/05/05 22:27:04 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:46587 (size: 2.6 KB, free: 530.3 MB)
16/05/05 22:27:04 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 652 ms on localhost (1/1)
16/05/05 22:27:04 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/05/05 22:27:04 INFO scheduler.DAGScheduler: ResultStage 0 (runJob at PythonRDD.scala:393) finished in 4.558 s
16/05/05 22:27:04 INFO scheduler.DAGScheduler: Job 0 finished: runJob at PythonRDD.scala:393, took 4.951328 s
[(0, 0), (1, 1), (2, 0), (3, 1), (4, 0), (5, 1), (6, 0), (7, 1), (8, 0), (9, 1)]

Troubleshooting

If something goes wrong, see Help and support.

Further information

See the Spark and PySpark documentation pages for more information.