===============================
Configuring Anaconda with Spark
===============================

You can configure Anaconda to work with Spark jobs in three ways:
:ref:`with the "spark-submit" command <scale-spark-config-spark-submit>`, or
:ref:`with Jupyter Notebooks and Cloudera CDH <scale-spark-config-cloudera>`, or
:ref:`with Jupyter Notebooks and Hortonworks HDP <scale-spark-config-hortonworks>`.

After you configure Anaconda with one of those three methods, then you can
:ref:`create and initialize a SparkContext <scale-spark-config-sparkcontext>`.


.. _scale-spark-config-spark-submit:

Configuring Anaconda with the ``spark-submit`` command
======================================================

You can submit Spark jobs using the ``PYSPARK_PYTHON`` environment variable that
refers to the location of the Python executable in Anaconda.

EXAMPLE:

.. code-block:: bash

    PYSPARK_PYTHON=/opt/continuum/anaconda/bin/python spark-submit pyspark_script.py


.. _scale-spark-config-cloudera:

Configuring Anaconda with Jupyter Notebooks and Cloudera CDH
============================================================

Configure Jupyter Notebooks to use Anaconda Scale with Cloudera CDH
using the following Python code at the top of your notebook:

.. code-block:: python

    import os
    import sys
    os.environ["PYSPARK_PYTHON"] = "/opt/continuum/anaconda/bin/python"
    os.environ["JAVA_HOME"] = "/usr/java/jdk1.7.0_67-cloudera/jre"
    os.environ["SPARK_HOME"] = "/opt/cloudera/parcels/CDH/lib/spark"
    os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
    sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")
    sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

The above configuration was tested with Cloudera CDH 5.11 and Spark 1.6.
Depending on the version of Cloudera CDH that you have installed, you might need
to customize these paths according to the location of Java, Spark and Anaconda
on your cluster.

If you've installed a custom Anaconda parcel, the path for ``PYSPARK_PYTHON``
will be ``/opt/cloudera/parcels/PARCEL_NAME/bin/python``, where ``PARCEL_NAME``
is the name of the custom parcel you created.

.. _scale-spark-config-hortonworks:

Configuring Anaconda with Jupyter Notebooks and Hortonworks HDP
===============================================================

Configure Jupyter Notebooks to use Anaconda Scale with Hortonworks HDP
using the following Python code at the top of your notebook:

.. code-block:: python

    import os
    import sys
    os.environ["PYSPARK_PYTHON"] = "/opt/continuum/anaconda/bin/python"
    os.environ["SPARK_HOME"] = "/usr/hdp/current/spark-client"
    os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
    sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")
    sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

The above configuration was tested with Hortonworks HDP 2.6, Apache Ambari 2.4
and Spark 1.6. Depending on the version of Hortonworks HDP that you have
installed, you might need to customize these paths according to the location of
Spark and Anaconda on your cluster.

If you've installed a custom Anaconda management pack, the path for
``PYSPARK_PYTHON`` will be ``/opt/continuum/PARCEL_NAME/bin/python``,
where ``PARCEL_NAME`` is the name of the custom parcel you created.

.. _scale-spark-config-sparkcontext:

Creating a SparkContext
=======================

Once you have configured the appropriate environment variables, you can initialize
a SparkContext--in ``yarn-client`` client mode in this example--using:

.. code-block:: python

    from pyspark import SparkConf
    from pyspark import SparkContext
    conf = SparkConf()
    conf.setMaster('yarn-client')
    conf.setAppName('anaconda-pyspark')
    sc = SparkContext(conf=conf)

For more information about configuring Spark settings, see the
`PySpark documentation <http://spark.apache.org/docs/latest/programming-guide.html>`_.

Once you've initialized a SparkContext, you can start using Anaconda with Spark
jobs. For examples of Spark jobs that use libraries from Anaconda, see :doc:`Using Anaconda with Spark <../spark>`.
