=========================
Using Anaconda with Spark
=========================

`Apache Spark <http://spark.apache.org/>`_ is an analytics engine and parallel
computation framework with Scala, Python and R interfaces. Spark can load data
directly from disk, memory and other data storage technologies such as Amazon
S3, Hadoop Distributed File System (HDFS), HBase, Cassandra and others.

Anaconda Scale can be used with a cluster that already has a managed 
Spark/Hadoop stack. Anaconda Scale can be installed alongside existing 
enterprise Hadoop distributions such as 
`Cloudera CDH <https://www.cloudera.com/products/apache-hadoop/key-cdh-components.html>`_ or 
`Hortonworks HDP <http://hortonworks.com/products/data-center/hdp/>`_ and can 
be used to manage Python and R conda packages and environments across a cluster.

To run a script on the head node, simply execute ``python example.py`` on the
cluster. Alternatively, you can install Jupyter Notebook on the cluster using
Anaconda Scale. See the :doc:`install` documentation for more
information.

.. _`submit-spark-job`:

Different ways to use Spark with Anaconda
=========================================

You can develop Spark scripts interactively, and you can write them as Python scripts or in a Jupyter Notebook.

You can submit a PySpark script to a Spark cluster using various methods:

* Run the script directly on the head node by executing python example.py on the cluster.
* Use the `spark-submit <https://spark.apache.org/docs/latest/submitting-applications.html>`_
  command either in Standalone mode or with the YARN resource manager.
* Submit the script interactively in an IPython shell or Jupyter Notebook on the cluster. For information on using Anaconda Scale to install Jupyter Notebook on the cluster, see :doc:`install`.

You can also use Anaconda Scale with enterprise Hadoop distributions such as
Cloudera CDH or Hortonworks HDP. 

Using Anaconda Scale with Spark
===============================

The topics listed below describe how to:

* Use Anaconda and Anaconda Scale with Apache Spark and PySpark
* Interact with data stored within the Hadoop Distributed File System (HDFS) on the cluster

While these tasks are independent and can be performed in any order, we recommend that you begin with :doc:`Configuring Anaconda with Spark <howto/spark-configuration>`.

.. toctree::
   :maxdepth: 1

   howto/spark-configuration
   howto/spark-basic
   howto/spark-yarn
   howto/spark-wordcount
   howto/spark-nltk
