=================================================
Running PySpark with the YARN resource manager
=================================================

This example runs a script on the Spark cluster with the YARN resource manager and
returns the hostname of each node in the cluster.


Who is this for?
================

This example is for users of a Spark cluster who wish to run a PySpark job using
the YARN resource manager.


.. _`scale-spark-yarn-before-you-start`:

Before you start
================

Download the :download:`spark-yarn.py example script <spark-yarn/spark-yarn.py>`
to your cluster.

You need Spark running with the YARN resource manager. You
can install Spark and YARN using an enterprise Hadoop distribution such as
`Cloudera CDH <https://www.cloudera.com/products/apache-hadoop/key-cdh-components.html>`_
or `Hortonworks HDP <http://hortonworks.com/products/data-center/hdp/>`_.


.. _`scale-spark-yarn-run-yarn-job`:

Running the job
===============

This code is almost the same as the code on the page :doc:`spark-basic`, which 
describes the code in more detail.

Here is the complete script to run the Spark + YARN example in PySpark:

.. code-block:: python

    # spark-yarn.py
    from pyspark import SparkConf
    from pyspark import SparkContext

    conf = SparkConf()
    conf.setMaster('yarn-client')
    conf.setAppName('spark-yarn')
    sc = SparkContext(conf=conf)


    def mod(x):
        import numpy as np
        return (x, np.mod(x, 2))

    rdd = sc.parallelize(range(1000)).map(mod).take(10)
    print rdd

NOTE: You may need to install NumPy on the cluster nodes using
``adam scale -n cluster conda install numpy``.

Run the script on the Spark cluster with
`spark-submit <https://spark.apache.org/docs/latest/submitting-applications.html>`_.
The output shows the first 10 values that were returned from the ``spark-basic.py`` script.

.. code-block:: text

    16/05/05 22:26:53 INFO spark.SparkContext: Running Spark version 1.6.0

    [...]

    16/05/05 22:27:03 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 3242 bytes)
    16/05/05 22:27:04 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:46587 (size: 2.6 KB, free: 530.3 MB)
    16/05/05 22:27:04 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 652 ms on localhost (1/1)
    16/05/05 22:27:04 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool
    16/05/05 22:27:04 INFO scheduler.DAGScheduler: ResultStage 0 (runJob at PythonRDD.scala:393) finished in 4.558 s
    16/05/05 22:27:04 INFO scheduler.DAGScheduler: Job 0 finished: runJob at PythonRDD.scala:393, took 4.951328 s
    [(0, 0), (1, 1), (2, 0), (3, 1), (4, 0), (5, 1), (6, 0), (7, 1), (8, 0), (9, 1)]


.. _`scale-spark-yarn-troubleshooting`:

Troubleshooting
===============

If something goes wrong, see :doc:`Help and support <../help-support>`.


.. _`scale-spark-yarn-further-info`:

Further information
===================

See the Spark_ and PySpark_ documentation pages for more information.

.. _Spark: https://spark.apache.org/
.. _PySpark: https://spark.apache.org/docs/latest/programming-guide.html
