Overview of Spark, YARN and HDFS¶
Spark is an analytics engine and framework that is capable of running queries 100 times faster than traditional MapReduce jobs written in Hadoop. In addition to the performance boost, developers can write Spark jobs in Scala, Python and Java if they so desire. Spark can load data directly from disk, memory and other data storage technologies such as Amazon S3, Hadoop Distributed File System (HDFS), HBase, Cassandra, etc.
You can install Spark using an enterprise Hadoop distribution such as Cloudera CDH or Hortonworks HDP.
Submitting Spark Jobs¶
Spark scripts are often developed interactively and can be written as a script file or as a Jupyter notebook file.
A Spark script can be submitted to a Spark cluster using various methods:
- Running the script directly on the head node.
- Using the spark-submit script in either Standalone mode or with the YARN resource manager
- Interactively in an IPython shell or Jupyter Notebook on the cluster
To run a script on the head node, simply execute python example.py
on the cluster.
Note: that in order to launch Jupyter Notebook on the cluster, the plugin must already be installed. See the Plugins documentation for more information.
Working with Data in HDFS¶
Moving data in and around HDFS can be difficult. If you need to move data
from your local machine to HDFS, from Amazon S3 to HDFS, from Amazon S3 to
Redshift, from HDFS to Hive and so on, we recommend using
odo, which is part of the
Blaze ecosystem. Odo
efficiently migrates
data from the source to the target through a network of conversions.
If you are unfamiliar with Spark and/or SQL, we recommend using Blaze to express selections, aggregations, groupbys, etc. in a dataframe-like style. Blaze provides Python users with a familiar interface to query data that exists in different data storage systems.