GPU Reduction
==============

.. raw:: html

    <p>Writing a reduction algorithm for CUDA GPU can be tricky.
    NumbaPro provides a <code class="docutils literal"><span class="pre">&#64;reduce</span></code> decorator for converting simple binary operation into a reduction kernel.</p>
    <div class="section" id="reduce">
    <h2><code class="docutils literal"><span class="pre">&#64;reduce</span></code><a class="headerlink" href="#reduce" title="Permalink to this headline">¶</a></h2>
    <p>Example:</p>
    <div class="highlight-python"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">numpy</span>
    <span class="kn">from</span> <span class="nn">numbapro</span> <span class="kn">import</span> <span class="n">cuda</span>

    <span class="nd">@cuda.reduce</span>
    <span class="k">def</span> <span class="nf">sum_reduce</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">):</span>
        <span class="k">return</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span>

    <span class="n">A</span> <span class="o">=</span> <span class="p">(</span><span class="n">numpy</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">1234</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">numpy</span><span class="o">.</span><span class="n">float64</span><span class="p">))</span> <span class="o">+</span> <span class="mi">1</span>
    <span class="n">expect</span> <span class="o">=</span> <span class="n">A</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>      <span class="c1"># numpy sum reduction</span>
    <span class="n">got</span> <span class="o">=</span> <span class="n">sum_reduce</span><span class="p">(</span><span class="n">A</span><span class="p">)</span>   <span class="c1"># cuda sum reduction</span>
    <span class="k">assert</span> <span class="n">expect</span> <span class="o">==</span> <span class="n">got</span>
    </pre></div>
    </div>
    <p>User can also use a lambda function:</p>
    <div class="highlight-python"><div class="highlight"><pre><span></span><span class="n">sum_reduce</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">.</span><span class="n">reduce</span><span class="p">(</span><span class="k">lambda</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">:</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span><span class="p">)</span>
    </pre></div>
    </div>
    <p>The decorated function <strong>must not use CUDA specific features</strong> because it is also used for host-side execution for the final round of reduction.</p>
    </div>
    <div class="section" id="class-reduce">
    <h2>class Reduce<a class="headerlink" href="#class-reduce" title="Permalink to this headline">¶</a></h2>
    <p>The <code class="docutils literal"><span class="pre">reduce</span></code> decorator creates an instance of the <code class="docutils literal"><span class="pre">Reduce</span></code> class.  (Currently, <code class="docutils literal"><span class="pre">reduce</span></code> is an alias to <code class="docutils literal"><span class="pre">Reduce</span></code>, but this behavior is not guaranteed.)</p>
