Overview of Spark, YARN and HDFS

Spark is an analytics engine and framework that is capable of running queries 100 times faster than traditional MapReduce jobs written in Hadoop. In addition to the performance boost, developers can write Spark jobs in Scala, Python and Java if they so desire. Spark can load data directly from disk, memory and other data storage technologies such as Amazon S3, Hadoop Distributed File System (HDFS), HBase, Cassandra, etc.

You can install Spark using an enterprise Hadoop distribution such as Cloudera CDH or Hortonworks HDP.

Submitting Spark Jobs

Spark scripts are often developed interactively and can be written as a script file or as a Jupyter notebook file.

A Spark script can be submitted to a Spark cluster using various methods:

  • Running the script directly on the head node.
  • Using the spark-submit script in either Standalone mode or with the YARN resource manager
  • Interactively in an IPython shell or Jupyter Notebook on the cluster

To run a script on the head node, simply execute python example.py on the cluster.

Note: that in order to launch Jupyter Notebook on the cluster, the plugin must already be installed. See the Plugins documentation for more information.

Working with Data in HDFS

Moving data in and around HDFS can be difficult. If you need to move data from your local machine to HDFS, from Amazon S3 to HDFS, from Amazon S3 to Redshift, from HDFS to Hive and so on, we recommend using odo, which is part of the Blaze ecosystem. Odo efficiently migrates data from the source to the target through a network of conversions.

If you are unfamiliar with Spark and/or SQL, we recommend using Blaze to express selections, aggregations, groupbys, etc. in a dataframe-like style. Blaze provides Python users with a familiar interface to query data that exists in different data storage systems.