======================================================
Distributed natural language processing
======================================================

This example provides a simple PySpark job that utilizes the
`NLTK library <http://www.nltk.org/>`_. NLTK is a popular Python package for
natural language processing. This example shows you how to integrate
third-party Python libraries with Spark. This example demonstrates the installation of
Python libraries on the cluster, the usage of Spark with the YARN resource
manager and execution of the PySpark job.


Who is this for?
================

This example is for users of a Spark cluster who wish to run a PySpark job
with the YARN resource manager.


.. _`scale-spark-nltk-before-you-start`:

Before you start
================

Download the :download:`spark-nltk.py example script <spark-nltk/spark-nltk.py>` or
:download:`spark-nltk.ipynb example notebook <spark-nltk/spark-nltk.ipynb>`.

You need Spark running with the YARN resource manager. You
can install Spark and YARN using an enterprise Hadoop distribution such as
`Cloudera CDH <https://www.cloudera.com/products/apache-hadoop/key-cdh-components.html>`_
or `Hortonworks HDP <http://hortonworks.com/products/data-center/hdp>`_.


.. _`scale-spark-nltk-install-nltk`:

Install NLTK
============

Install NLTK on all of the cluster nodes using the ``adam scale`` command:

.. code-block:: bash

    $ adam scale -n cluster conda install nltk

You should see output similar to this from each node, which indicates that the
package was successfully installed across the cluster:

.. code-block:: text

    All nodes (x4) response:
    {
      "actions": {
        "EXTRACT": [
          "conda-env-2.5.2-py27_0",
          "conda-4.1.11-py27_0"
        ],
        "FETCH": [
          "conda-env-2.5.2-py27_0",
          "conda-4.1.11-py27_0"
        ],
        "LINK": [
          "conda-env-2.5.2-py27_0 1 None",
          "conda-4.1.11-py27_0 1 None"
        ],
        "PREFIX": "/opt/continuum/anaconda",
        "SYMLINK_CONDA": [
          "/opt/continuum/anaconda"
        ],
        "UNLINK": [
          "conda-4.1.6-py27_0",
          "conda-env-2.5.1-py27_0"
        ],
        "op_order": [
          "RM_FETCHED",
          "FETCH",
          "RM_EXTRACTED",
          "EXTRACT",
          "UNLINK",
          "LINK",
          "SYMLINK_CONDA"
        ]
      },
      "success": true
    }

For this example, you need to download the NLTK sample data. Download the data
on all cluster nodes by using the ``adam cmd`` command:

.. code-block:: text

    $ adam cmd 'sudo /opt/continuum/anaconda/bin/python -m nltk.downloader -d /usr/share/nltk_data all'

The sample data is downloaded over a few minutes. After the download 
completes, you should see output similar to:

.. code-block:: text

    All nodes (x4) response: [nltk_data] Downloading collection 'all'
    [nltk_data]    |
    [nltk_data]    | Downloading package abc to /usr/share/nltk_data...
    [nltk_data]    |   Unzipping corpora/abc.zip.
    [nltk_data]    | Downloading package alpino to /usr/share/nltk_data...
    [nltk_data]    |   Unzipping corpora/alpino.zip.
    [nltk_data]    | Downloading package biocreative_ppi to
    [nltk_data]    |     /usr/share/nltk_data...

    ....

    [nltk_data]    |   Unzipping models/bllip_wsj_no_aux.zip.
    [nltk_data]    | Downloading package word2vec_sample to
    [nltk_data]    |     /usr/share/nltk_data...
    [nltk_data]    |   Unzipping models/word2vec_sample.zip.
    [nltk_data]    |
    [nltk_data]  Done downloading collection all


.. _`scale-spark-nltk-run-nltk-job`:

Running the job
===============

Here is the complete script to run the Spark + NLTK example in PySpark:

.. code-block:: python

    # spark-nltk.py
    from pyspark import SparkConf
    from pyspark import SparkContext

    conf = SparkConf()
    conf.setMaster('yarn-client')
    conf.setAppName('spark-nltk')
    sc = SparkContext(conf=conf)

    data = sc.textFile('file:///usr/share/nltk_data/corpora/state_union/1972-Nixon.txt')

    def word_tokenize(x):
        import nltk
        return nltk.word_tokenize(x)

    def pos_tag(x):
        import nltk
        return nltk.pos_tag([x])

    words = data.flatMap(word_tokenize)
    print words.take(10)

    pos_word = words.map(pos_tag)
    print pos_word.take(5)


Examine the above code example. First, it imports PySpark and
creates a SparkContext:

.. code-block:: python

    from pyspark import SparkConf
    from pyspark import SparkContext

    conf = SparkConf()
    conf.setMaster('yarn-client')
    conf.setAppName('spark-nltk')
    sc = SparkContext(conf=conf)

After a SparkContext is created, we can load some data into Spark. In this case,
the data file is from one of the example documents provided by NLTK.

NOTE: You could also copy the data to HDFS and load it from Spark.

.. code-block:: python

    data = sc.textFile('file:///usr/share/nltk_data/corpora/state_union/1972-Nixon.txt')

Next is a function called ``word_tokenize`` that imports ``nltk`` on
the Spark worker nodes and calls ``nltk.word_tokenize``. The function is mapped
to the text file that was read in the previous step:

.. code-block:: python

    def word_tokenize(x):
        import nltk
        return nltk.word_tokenize(x)

    words = data.flatMap(word_tokenize)

You can confirm that the ``flatMap`` operation worked by returning some of the
words in the dataset:

.. code-block:: python

    print words.take(10)

Finally, you can use NTLK's
`part-of-speech tagger <http://www.nltk.org/book/ch05.html>`_ to attach the part
of speech to each word in the data set:

.. code-block:: python

    def pos_tag(x):
        import nltk
        return nltk.pos_tag([x])

    pos_word = words.map(pos_tag)
    print pos_word.take(5)

Run the script on the Spark cluster using
the `spark-submit <https://spark.apache.org/docs/latest/submitting-applications.html>`_ script.
The output shows the words that were returned from the Spark script,
including the results from the ``flatMap`` operation and the ``POS-tagger``:

.. code-block:: text

    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    15/06/13 05:14:29 INFO SparkContext: Running Spark version 1.4.0

    [...]

    ['Address',
     'on',
     'the',
     'State',
     'of',
     'the',
     'Union',
     'Delivered',
     'Before',
     'a']

    [...]

    [[('Address', 'NN')],
     [('on', 'IN')],
     [('the', 'DT')],
     [('State', 'NNP')],
     [('of', 'IN')]]


.. _`scale-spark-nltk-troubleshooting`:

Troubleshooting
===============

If something goes wrong, consult :doc:`../help-support`.


.. _`scale-spark-nltk-further-info`:

Further information
===================

See the Spark_ and PySpark_ documentation:

.. _Spark: https://spark.apache.org/
.. _PySpark: https://spark.apache.org/docs/latest/programming-guide.html

For more information on NLTK, see the `NLTK book <http://www.nltk.org/book/>`_.
