Create custom Jupyter kernel for Pyspark (AEN 4.1.3)
====================================================

.. raw:: html

    <p>These instructions add a custom Jupyter Notebook option to allow users to select PySpark as the kernel.</p>
    <div class="section" id="install-spark">
    <h2>Install Spark<a class="headerlink" href="#install-spark" title="Permalink to this headline">¶</a></h2>
    <p>The easiest way to install Spark is with <a class="reference external" href="https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_cdh5_install.html#topic_4_4">Cloudera CDH</a>.</p>
    <p>You will use YARN as a resource manager. After installing Cloudera CDH, <a class="reference external" href="https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_spark_installation.html">install Spark</a>. Spark comes with a PySpark shell.</p>
    </div>
    <div class="section" id="create-a-notebook-kernel-for-pyspark">
    <h2>Create a notebook kernel for PySpark<a class="headerlink" href="#create-a-notebook-kernel-for-pyspark" title="Permalink to this headline">¶</a></h2>
    <p>You may create the kernel as an administrator or as a regular user. Read the instructions below to help you choose which method to use.</p>
    <div class="section" id="as-an-administrator">
    <h3>1. As an administrator<a class="headerlink" href="#as-an-administrator" title="Permalink to this headline">¶</a></h3>
    <p>Create a new kernel and point it to the root env in each project. To do so create a directory &#8216;pyspark&#8217; in <cite>/opt/wakari/wakari-compute/share/jupyter/kernels/</cite>.</p>
    <p>Create the following kernel.json file:</p>
    <div class="highlight-default"><div class="highlight"><pre><span></span><span class="p">{</span><span class="s2">&quot;argv&quot;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&quot;/opt/wakari/anaconda/bin/python&quot;</span><span class="p">,</span>
     <span class="s2">&quot;-m&quot;</span><span class="p">,</span> <span class="s2">&quot;ipykernel&quot;</span><span class="p">,</span> <span class="s2">&quot;-f&quot;</span><span class="p">,</span> <span class="s2">&quot;connection_file}&quot;</span><span class="p">,</span> <span class="s2">&quot;--profile&quot;</span><span class="p">,</span> <span class="s2">&quot;pyspark&quot;</span><span class="p">],</span>
     <span class="s2">&quot;display_name&quot;</span><span class="p">:</span><span class="s2">&quot;PySpark&quot;</span><span class="p">,</span>  <span class="s2">&quot;language&quot;</span><span class="p">:</span><span class="s2">&quot;python&quot;</span> <span class="p">}</span>
    </pre></div>
    </div>
    <p>You may choose any name for the &#8216;display_name&#8217;.</p>
    <p>This configuration is pointing to the python executable in the root environment. Since that environment is under admin control, users cannot add new packages to the environment. They will need an admin to help update the environment.</p>
    </div>
    <div class="section" id="as-a-regular-user">
    <h3>2. As a regular user<a class="headerlink" href="#as-a-regular-user" title="Permalink to this headline">¶</a></h3>
    <p>Create a new directory in the user&#8217;s home directory: <cite>.local/share/jupyter/kernels/pyspark/</cite>. This way the user will be using the default environment and able to upgrade or install new packages.</p>
    <p>Create the following kernel.json file:</p>
    <div class="highlight-default"><div class="highlight"><pre><span></span><span class="p">{</span><span class="s2">&quot;argv&quot;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&quot;/projects/&lt;username&gt;/&lt;project_name&gt;/envs/default/bin/python&quot;</span><span class="p">,</span>
     <span class="s2">&quot;-m&quot;</span><span class="p">,</span> <span class="s2">&quot;ipykernel&quot;</span><span class="p">,</span> <span class="s2">&quot;-f&quot;</span><span class="p">,</span> <span class="s2">&quot;connection_file}&quot;</span><span class="p">,</span> <span class="s2">&quot;--profile&quot;</span><span class="p">,</span> <span class="s2">&quot;pyspark&quot;</span><span class="p">],</span>
     <span class="s2">&quot;display_name&quot;</span><span class="p">:</span><span class="s2">&quot;PySpark&quot;</span><span class="p">,</span>  <span class="s2">&quot;language&quot;</span><span class="p">:</span><span class="s2">&quot;python&quot;</span> <span class="p">}</span>
    </pre></div>
    </div>
    <p>NOTE: Replace &#8220;&lt;username&gt;&#8221; with the correct user name and &#8220;&lt;project_name&gt;&#8221; with the correct project name.</p>
    <p>You may choose any name for the &#8216;display_name&#8217;.</p>
    </div>
    </div>
    <div class="section" id="create-an-ipython-profile">
    <h2>Create an iPython profile<a class="headerlink" href="#create-an-ipython-profile" title="Permalink to this headline">¶</a></h2>
    <p>The above profile call from the kernel requires that we define a particular PySpark profile. This profile should be created for each user that logs in to AEN to use the PySpark kernel.</p>
    <p>In the user&#8217;s home, create the directory and file <code class="docutils literal"><span class="pre">~/.ipython/profile_pyspark/startup/00-pyspark-setup.py</span></code> with the file contents:</p>
    <div class="highlight-default"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">os</span>
    <span class="kn">import</span> <span class="nn">sys</span>

    <span class="c1"># The place where CDH installed spark, if the user installed Spark locally it can be changed here.</span>
    <span class="c1"># Optionally we can check if the variable can be retrieved from environment.</span>

    <span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s2">&quot;SPARK_HOME&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="s2">&quot;/usr/lib/spark&quot;</span>

    <span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s2">&quot;PYSPARK_PYTHON&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="s2">&quot;/opt/wakari/anaconda/bin/python&quot;</span>

    <span class="c1"># And Python path</span>
    <span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s2">&quot;PYLIB&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s2">&quot;SPARK_HOME&quot;</span><span class="p">]</span> <span class="o">+</span> <span class="s2">&quot;/python/lib&quot;</span>
    <span class="n">sys</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">insert</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s2">&quot;PYLIB&quot;</span><span class="p">]</span> <span class="o">+</span><span class="s2">&quot;/py4j-0.9-src.zip&quot;</span><span class="p">)</span>  <span class="c1">#10.4-src.zip&quot;)</span>
    <span class="n">sys</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">insert</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s2">&quot;PYLIB&quot;</span><span class="p">]</span> <span class="o">+</span><span class="s2">&quot;/pyspark.zip&quot;</span><span class="p">)</span>

    <span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s2">&quot;PYSPARK_SUBMIT_ARGS&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="s2">&quot;--name yarn pyspark-shell&quot;</span>
    </pre></div>
    </div>
    <p>Now log in using the user account that has the PySpark profile.</p>
    <p>When creating a new notebook in a project, now there will be the option to select PySpark as the kernel. When creating such a notebook you&#8217;ll be able to import pyspark and start using it:</p>
    <div class="highlight-default"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark</span> <span class="k">import</span> <span class="n">SparkConf</span>
    <span class="kn">from</span> <span class="nn">pyspark</span> <span class="k">import</span> <span class="n">SparkContext</span>
    </pre></div>
    </div>
    <p>NOTE: You can always add those lines and any other command you may use frequently in the PySpark setup file <code class="docutils literal"><span class="pre">00-pyspark-setup.py</span></code> as shown above.</p>
    </div>
