Running PySpark with the YARN resource manager¶
This example runs a script on the Spark cluster with the YARN resource manager and returns the hostname of each node in the cluster.
Who is this for?¶
This example is for users of a Spark cluster who wish to run a PySpark job using the YARN resource manager.
Before you start¶
spark-yarn.py example script
to your cluster.
Running the job¶
This code is almost the same as the code on the page Running PySpark as a Spark standalone job, which describes the code in more detail.
Here is the complete script to run the Spark + YARN example in PySpark:
# spark-yarn.py from pyspark import SparkConf from pyspark import SparkContext conf = SparkConf() conf.setMaster('yarn-client') conf.setAppName('spark-yarn') sc = SparkContext(conf=conf) def mod(x): import numpy as np return (x, np.mod(x, 2)) rdd = sc.parallelize(range(1000)).map(mod).take(10) print rdd
NOTE: You may need to install NumPy on the cluster nodes using
adam scale -n cluster conda install numpy.
Run the script on the Spark cluster with
The output shows the first 10 values that were returned from the
16/05/05 22:26:53 INFO spark.SparkContext: Running Spark version 1.6.0 [...] 16/05/05 22:27:03 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 3242 bytes) 16/05/05 22:27:04 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:46587 (size: 2.6 KB, free: 530.3 MB) 16/05/05 22:27:04 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 652 ms on localhost (1/1) 16/05/05 22:27:04 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool 16/05/05 22:27:04 INFO scheduler.DAGScheduler: ResultStage 0 (runJob at PythonRDD.scala:393) finished in 4.558 s 16/05/05 22:27:04 INFO scheduler.DAGScheduler: Job 0 finished: runJob at PythonRDD.scala:393, took 4.951328 s [(0, 0), (1, 1), (2, 0), (3, 1), (4, 0), (5, 1), (6, 0), (7, 1), (8, 0), (9, 1)]