Create custom Jupyter kernel for Pyspark (AEN 4.2.0)
====================================================

.. raw:: html

    <p>These instructions add a custom Jupyter Notebook option to allow users to select PySpark as the kernel.</p>
    <div class="section" id="install-spark">
    <h2>Install Spark<a class="headerlink" href="#install-spark" title="Permalink to this headline">¶</a></h2>
    <p>The easiest way to install Spark is with <a class="reference external" href="https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_cdh5_install.html#topic_4_4">Cloudera CDH</a>.</p>
    <p>You will use YARN as a resource manager. After installing Cloudera CDH, <a class="reference external" href="https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_spark_installation.html">install Spark</a>. Spark comes with a PySpark shell.</p>
    </div>
    <div class="section" id="create-a-notebook-kernel-for-pyspark">
    <h2>Create a notebook kernel for PySpark<a class="headerlink" href="#create-a-notebook-kernel-for-pyspark" title="Permalink to this headline">¶</a></h2>
    <p>You may create the kernel as an administrator or as a regular user. Read the instructions below to help you choose which method to use.</p>
    <div class="section" id="as-an-administrator">
    <h3>1. As an administrator<a class="headerlink" href="#as-an-administrator" title="Permalink to this headline">¶</a></h3>
    <p>Create a new kernel and point it to the root env in each project. To do so create a directory &#8216;pyspark&#8217; in <cite>/opt/wakari/wakari-compute/share/jupyter/kernels/</cite>.</p>
    <p>Create the following kernel.json file:</p>
    <div class="highlight-default"><div class="highlight"><pre><span></span><span class="p">{</span><span class="s2">&quot;argv&quot;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&quot;/opt/wakari/anaconda/bin/python&quot;</span><span class="p">,</span>
     <span class="s2">&quot;-m&quot;</span><span class="p">,</span> <span class="s2">&quot;ipykernel&quot;</span><span class="p">,</span> <span class="s2">&quot;-f&quot;</span><span class="p">,</span> <span class="s2">&quot;connection_file}&quot;</span><span class="p">,</span> <span class="s2">&quot;--profile&quot;</span><span class="p">,</span> <span class="s2">&quot;pyspark&quot;</span><span class="p">],</span>
     <span class="s2">&quot;display_name&quot;</span><span class="p">:</span><span class="s2">&quot;PySpark&quot;</span><span class="p">,</span>  <span class="s2">&quot;language&quot;</span><span class="p">:</span><span class="s2">&quot;python&quot;</span> <span class="p">}</span>
    </pre></div>
    </div>
    <p>You may choose any name for the &#8216;display_name&#8217;.</p>
    <p>This configuration is pointing to the python executable in the root environment. Since that environment is under admin control, users cannot add new packages to the environment. They will need an admin to help update the environment.</p>
    </div>
    <div class="section" id="as-an-administrator-without-ipython-profile">
    <h3>2. As an administrator without IPython profile<a class="headerlink" href="#as-an-administrator-without-ipython-profile" title="Permalink to this headline">¶</a></h3>
    <p>To have an admin level PySpark kernel without the user .ipython space:</p>
    <div class="highlight-default"><div class="highlight"><pre><span></span><span class="p">{</span><span class="s2">&quot;argv&quot;</span><span class="p">:</span>
    <span class="p">[</span><span class="s2">&quot;/opt/wakari/wakari-compute/etc/ipython/pyspark.sh&quot;</span><span class="p">,</span> <span class="s2">&quot;-f&quot;</span><span class="p">,</span> <span class="s2">&quot;</span><span class="si">{connection_file}</span><span class="s2">&quot;</span><span class="p">],</span>
    <span class="s2">&quot;display_name&quot;</span><span class="p">:</span><span class="s2">&quot;PySpark&quot;</span><span class="p">,</span>  <span class="s2">&quot;language&quot;</span><span class="p">:</span><span class="s2">&quot;python&quot;</span> <span class="p">}</span>
    </pre></div>
    </div>
    <p>NOTE: The pyspark.sh script is defined in <a class="reference internal" href="#aen-custom-pyspark-kernel-wo-ipython-profile"><span class="std std-ref">Without IPython profile</span></a> section below.</p>
    </div>
    <div class="section" id="as-a-regular-user">
    <h3>3. As a regular user<a class="headerlink" href="#as-a-regular-user" title="Permalink to this headline">¶</a></h3>
    <p>Create a new directory in the user&#8217;s home directory: <cite>.local/share/jupyter/kernels/pyspark/</cite>. This way the user will be using the default environment and able to upgrade or install new packages.</p>
    <p>Create the following kernel.json file:</p>
    <div class="highlight-default"><div class="highlight"><pre><span></span><span class="p">{</span><span class="s2">&quot;argv&quot;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&quot;/projects/&lt;username&gt;/&lt;project_name&gt;/envs/default/bin/python&quot;</span><span class="p">,</span>
     <span class="s2">&quot;-m&quot;</span><span class="p">,</span> <span class="s2">&quot;ipykernel&quot;</span><span class="p">,</span> <span class="s2">&quot;-f&quot;</span><span class="p">,</span> <span class="s2">&quot;connection_file}&quot;</span><span class="p">,</span> <span class="s2">&quot;--profile&quot;</span><span class="p">,</span> <span class="s2">&quot;pyspark&quot;</span><span class="p">],</span>
     <span class="s2">&quot;display_name&quot;</span><span class="p">:</span><span class="s2">&quot;PySpark&quot;</span><span class="p">,</span>  <span class="s2">&quot;language&quot;</span><span class="p">:</span><span class="s2">&quot;python&quot;</span> <span class="p">}</span>
    </pre></div>
    </div>
    <p>NOTE: Replace &#8220;&lt;username&gt;&#8221; with the correct user name and &#8220;&lt;project_name&gt;&#8221; with the correct project name.</p>
    <p>You may choose any name for the &#8216;display_name&#8217;.</p>
    </div>
    </div>
    <div class="section" id="create-an-ipython-profile">
    <h2>Create an IPython profile<a class="headerlink" href="#create-an-ipython-profile" title="Permalink to this headline">¶</a></h2>
    <p>The above profile call from the kernel requires that we define a particular PySpark profile. This profile should be created for each user that logs in to AEN to use the PySpark kernel.</p>
    <p>In the user&#8217;s home, create the directory and file <code class="docutils literal"><span class="pre">~/.ipython/profile_pyspark/startup/00-pyspark-setup.py</span></code> with the file contents:</p>
    <div class="highlight-default"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">os</span>
    <span class="kn">import</span> <span class="nn">sys</span>

    <span class="c1"># The place where CDH installed spark, if the user installed Spark locally it can be changed here.</span>
    <span class="c1"># Optionally we can check if the variable can be retrieved from environment.</span>

    <span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s2">&quot;SPARK_HOME&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="s2">&quot;/usr/lib/spark&quot;</span>

    <span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s2">&quot;PYSPARK_PYTHON&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="s2">&quot;/opt/wakari/anaconda/bin/python&quot;</span>

    <span class="c1"># And Python path</span>
    <span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s2">&quot;PYLIB&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s2">&quot;SPARK_HOME&quot;</span><span class="p">]</span> <span class="o">+</span> <span class="s2">&quot;/python/lib&quot;</span>
    <span class="n">sys</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">insert</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s2">&quot;PYLIB&quot;</span><span class="p">]</span> <span class="o">+</span><span class="s2">&quot;/py4j-0.9-src.zip&quot;</span><span class="p">)</span>  <span class="c1">#10.4-src.zip&quot;)</span>
    <span class="n">sys</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">insert</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s2">&quot;PYLIB&quot;</span><span class="p">]</span> <span class="o">+</span><span class="s2">&quot;/pyspark.zip&quot;</span><span class="p">)</span>

    <span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s2">&quot;PYSPARK_SUBMIT_ARGS&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="s2">&quot;--name yarn pyspark-shell&quot;</span>
    </pre></div>
    </div>
    <p>Now log in using the user account that has the PySpark profile.</p>
    <div class="section" id="without-ipython-profile">
    <span id="aen-custom-pyspark-kernel-wo-ipython-profile"></span><h3>Without IPython profile<a class="headerlink" href="#without-ipython-profile" title="Permalink to this headline">¶</a></h3>
    <p>If it is necessary to avoid creating a local profile for the users, a script
    can be made to be called from the kernel. Create a bash script that will load
    the environment variables:</p>
    <div class="highlight-default"><div class="highlight"><pre><span></span>sudo -u $AEN_SRVC_ACCT mkdir /opt/wakari/wakari-compute/etc/ipython
    sudo -u $AEN_SRVC_ACCT touch /opt/wakari/wakari-compute/etc/ipython/pyspark.sh
    sudo -u $AEN_SRVC_ACCT chmod a+x /opt/wakari/wakari-compute/etc/ipython/pyspark.sh
    </pre></div>
    </div>
    <p>The contents of the file should look like:</p>
    <div class="highlight-default"><div class="highlight"><pre><span></span>#!/usr/bin/env bash
    # setup environment variable, etc.

    export PYSPARK_PYTHON=&quot;/opt/wakari/anaconda/bin/python&quot;
    export SPARK_HOME=&quot;/usr/lib/spark&quot;

    # And Python path
    export PYLIB=$SPARK_HOME:/python/lib
    export PYTHONPATH=$PYTHONPATH:$PYLIB:/py4j-0.9-src.zip
    export PYTHONPATH=$PYTHONPATH:$PYLIB:/pyspark.zip

    export PYSPARK_SUBMIT_ARGS=&quot;--name yarn pyspark-shell&quot;

    # run the ipykernel
    exec /opt/wakari/anaconda/bin/python -m ipykernel $@
    </pre></div>
    </div>
    </div>
    </div>
    <div class="section" id="using-pyspark">
    <h2>Using PySpark<a class="headerlink" href="#using-pyspark" title="Permalink to this headline">¶</a></h2>
    <p>When creating a new notebook in a project, now there will be the option to select PySpark as the kernel. When creating such a notebook you&#8217;ll be able to import pyspark and start using it:</p>
    <div class="highlight-default"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark</span> <span class="k">import</span> <span class="n">SparkConf</span>
    <span class="kn">from</span> <span class="nn">pyspark</span> <span class="k">import</span> <span class="n">SparkContext</span>
    </pre></div>
    </div>
    <p>NOTE: You can always add those lines and any other command you may use frequently in the PySpark setup file <code class="docutils literal"><span class="pre">00-pyspark-setup.py</span></code> as shown above.</p>
    </div>
