Configuring Anaconda with Spark

You can configure Anaconda to work with Spark jobs in three ways: with the “spark-submit” command, or with Jupyter Notebooks and Cloudera CDH, or with Jupyter Notebooks and Hortonworks HDP.

After you configure Anaconda with one of those three methods, then you can create and initialize a SparkContext.

Configuring Anaconda with the spark-submit command

You can submit Spark jobs using the PYSPARK_PYTHON environment variable that refers to the location of the Python executable in Anaconda.

EXAMPLE:

PYSPARK_PYTHON=/opt/continuum/anaconda/bin/python spark-submit pyspark_script.py

Configuring Anaconda with Jupyter Notebooks and Cloudera CDH

Configure Jupyter Notebooks to use Anaconda Scale with Cloudera CDH using the following Python code at the top of your notebook:

import os
import sys
os.environ["PYSPARK_PYTHON"] = "/opt/continuum/anaconda/bin/python"
os.environ["JAVA_HOME"] = "/usr/java/jdk1.7.0_67-cloudera/jre"
os.environ["SPARK_HOME"] = "/opt/cloudera/parcels/CDH/lib/spark"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

The above configuration was tested with Cloudera CDH 5.11 and Spark 1.6. Depending on the version of Cloudera CDH that you have installed, you might need to customize these paths according to the location of Java, Spark and Anaconda on your cluster.

If you’ve installed a custom Anaconda parcel, the path for PYSPARK_PYTHON will be /opt/cloudera/parcels/PARCEL_NAME/bin/python, where PARCEL_NAME is the name of the custom parcel you created.

Configuring Anaconda with Jupyter Notebooks and Hortonworks HDP

Configure Jupyter Notebooks to use Anaconda Scale with Hortonworks HDP using the following Python code at the top of your notebook:

import os
import sys
os.environ["PYSPARK_PYTHON"] = "/opt/continuum/anaconda/bin/python"
os.environ["SPARK_HOME"] = "/usr/hdp/current/spark-client"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

The above configuration was tested with Hortonworks HDP 2.6, Apache Ambari 2.4 and Spark 1.6. Depending on the version of Hortonworks HDP that you have installed, you might need to customize these paths according to the location of Spark and Anaconda on your cluster.

If you’ve installed a custom Anaconda management pack, the path for PYSPARK_PYTHON will be /opt/continuum/PARCEL_NAME/bin/python, where PARCEL_NAME is the name of the custom parcel you created.

Creating a SparkContext

Once you have configured the appropriate environment variables, you can initialize a SparkContext–in yarn-client client mode in this example–using:

from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
conf.setMaster('yarn-client')
conf.setAppName('anaconda-pyspark')
sc = SparkContext(conf=conf)

For more information about configuring Spark settings, see the PySpark documentation.

Once you’ve initialized a SparkContext, you can start using Anaconda with Spark jobs. For examples of Spark jobs that use libraries from Anaconda, see Using Anaconda with Spark.