CUDA Sorting
============

.. raw:: html

    <p>Accelerate provides routines for sorting arrays on CUDA GPUs.</p>
    <div class="section" id="sorting-large-arrays">
    <h2>Sorting Large Arrays<a class="headerlink" href="#sorting-large-arrays" title="Permalink to this headline">¶</a></h2>
    <p>The <a class="reference internal" href="#accelerate.cuda.sorting.RadixSort" title="accelerate.cuda.sorting.RadixSort"><code class="xref py py-class docutils literal"><span class="pre">accelerate.cuda.sorting.RadixSort</span></code></a> class is recommended for
    sorting large (approx. more than 1 million items) arrays of numeric types.</p>
    <dl class="class">
    <dt id="accelerate.cuda.sorting.RadixSort">
    <em class="property">class </em><code class="descclassname">accelerate.cuda.sorting.</code><code class="descname">RadixSort</code><span class="sig-paren">(</span><em>maxcount</em>, <em>dtype</em>, <em>descending=False</em>, <em>stream=0</em><span class="sig-paren">)</span><a class="headerlink" href="#accelerate.cuda.sorting.RadixSort" title="Permalink to this definition">¶</a></dt>
    <dd><p>Provides radix sort and radix select.</p>
    <p>The algorithm implemented here is best for large arrays (<code class="docutils literal"><span class="pre">N</span> <span class="pre">&gt;</span> <span class="pre">1e6</span></code>) due to
    the latency introduced by its use of multiple kernel launches. It is
    recommended to use <code class="docutils literal"><span class="pre">segmented_sort</span></code> instead for batches of smaller arrays.</p>
    <table class="docutils field-list" frame="void" rules="none">
    <col class="field-name" />
    <col class="field-body" />
    <tbody valign="top">
    <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
    <li><strong>maxcount</strong> (<em>int</em>) &#8211; Maximum number of items to sort</li>
    <li><strong>dtype</strong> (<em>numpy.dtype</em>) &#8211; The element type to sort</li>
    <li><strong>descending</strong> (<em>bool</em>) &#8211; Sort in descending order?</li>
    <li><strong>stream</strong> &#8211; The CUDA stream to run the kernels in</li>
    </ul>
    </td>
    </tr>
    </tbody>
    </table>
    <dl class="method">
    <dt id="accelerate.cuda.sorting.RadixSort.argselect">
    <code class="descname">argselect</code><span class="sig-paren">(</span><em>k</em>, <em>keys</em>, <em>begin_bit=0</em>, <em>end_bit=None</em><span class="sig-paren">)</span><a class="headerlink" href="#accelerate.cuda.sorting.RadixSort.argselect" title="Permalink to this definition">¶</a></dt>
    <dd><p>Similar to <code class="docutils literal"><span class="pre">RadixSort.select</span></code> but returns the new sorted indices.</p>
    <table class="docutils field-list" frame="void" rules="none">
    <col class="field-name" />
    <col class="field-body" />
    <tbody valign="top">
    <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple">
    <li><strong>keys</strong> (<em>numpy.ndarray</em>) &#8211; Keys to sort inplace</li>
    <li><strong>begin_bit</strong> (<em>int</em>) &#8211; The first bit to sort</li>
    <li><strong>end_bit</strong> (<em>int</em>) &#8211; Optional. The last bit to sort</li>
    </ul>
    </td>
    </tr>
    <tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first last">The indices indicating the new ordering as an array on the CUDA
    device or on the host.</p>
    </td>
    </tr>
    </tbody>
    </table>
    </dd></dl>

    <dl class="method">
    <dt id="accelerate.cuda.sorting.RadixSort.argsort">
    <code class="descname">argsort</code><span class="sig-paren">(</span><em>keys</em>, <em>begin_bit=0</em>, <em>end_bit=None</em><span class="sig-paren">)</span><a class="headerlink" href="#accelerate.cuda.sorting.RadixSort.argsort" title="Permalink to this definition">¶</a></dt>
    <dd><p>Similar to <code class="docutils literal"><span class="pre">RadixSort.sort</span></code> but returns the new sorted indices.</p>
    <table class="docutils field-list" frame="void" rules="none">
    <col class="field-name" />
    <col class="field-body" />
    <tbody valign="top">
    <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple">
    <li><strong>keys</strong> (<em>numpy.ndarray</em>) &#8211; Keys to sort inplace</li>
    <li><strong>begin_bit</strong> (<em>int</em>) &#8211; The first bit to sort</li>
    <li><strong>end_bit</strong> (<em>int</em>) &#8211; Optional. The last bit to sort</li>
    </ul>
    </td>
    </tr>
    <tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first last">The indices indicating the new ordering as an array on the CUDA
    device or on the host.</p>
    </td>
    </tr>
    </tbody>
    </table>
    </dd></dl>

    <dl class="method">
    <dt id="accelerate.cuda.sorting.RadixSort.close">
    <code class="descname">close</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#accelerate.cuda.sorting.RadixSort.close" title="Permalink to this definition">¶</a></dt>
    <dd><p>Explicitly release internal resources</p>
    <p>Called automatically when the object is deleted.</p>
    </dd></dl>

    <dl class="method">
    <dt id="accelerate.cuda.sorting.RadixSort.init_arg">
    <code class="descname">init_arg</code><span class="sig-paren">(</span><em>size</em><span class="sig-paren">)</span><a class="headerlink" href="#accelerate.cuda.sorting.RadixSort.init_arg" title="Permalink to this definition">¶</a></dt>
    <dd><p>Initialize an empty CUDA ndarray of uint32 with ascending integers
    starting from zero</p>
    <table class="docutils field-list" frame="void" rules="none">
    <col class="field-name" />
    <col class="field-body" />
    <tbody valign="top">
    <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>size</strong> (<em>int</em>) &#8211; Number of elements for the output array</td>
    </tr>
    <tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body">An array with values <code class="docutils literal"><span class="pre">[0,</span> <span class="pre">1,</span> <span class="pre">2,</span> <span class="pre">...m</span> <span class="pre">size</span> <span class="pre">-</span> <span class="pre">1</span> <span class="pre">]</span></code></td>
    </tr>
    </tbody>
    </table>
    </dd></dl>

    <dl class="method">
    <dt id="accelerate.cuda.sorting.RadixSort.select">
    <code class="descname">select</code><span class="sig-paren">(</span><em>k</em>, <em>keys</em>, <em>vals=None</em>, <em>begin_bit=0</em>, <em>end_bit=None</em><span class="sig-paren">)</span><a class="headerlink" href="#accelerate.cuda.sorting.RadixSort.select" title="Permalink to this definition">¶</a></dt>
    <dd><p>Perform a inplace k-select on <code class="docutils literal"><span class="pre">keys</span></code>.</p>
    <p>Memory transfer is performed automatically.</p>
    <table class="docutils field-list" frame="void" rules="none">
    <col class="field-name" />
    <col class="field-body" />
    <tbody valign="top">
    <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
    <li><strong>keys</strong> (<em>numpy.ndarray</em>) &#8211; Keys to sort inplace</li>
    <li><strong>vals</strong> (<em>numpy.ndarray</em>) &#8211; Optional. Additional values to be reordered along the sort.
    It is modified in place. Only the <code class="docutils literal"><span class="pre">uint32</span></code> dtype is
    supported in this version.</li>
    <li><strong>begin_bit</strong> (<em>int</em>) &#8211; The first bit to sort</li>
    <li><strong>end_bit</strong> (<em>int</em>) &#8211; Optional. The last bit to sort</li>
    </ul>
    </td>
    </tr>
    </tbody>
    </table>
    </dd></dl>

    <dl class="method">
    <dt id="accelerate.cuda.sorting.RadixSort.sort">
    <code class="descname">sort</code><span class="sig-paren">(</span><em>keys</em>, <em>vals=None</em>, <em>begin_bit=0</em>, <em>end_bit=None</em><span class="sig-paren">)</span><a class="headerlink" href="#accelerate.cuda.sorting.RadixSort.sort" title="Permalink to this definition">¶</a></dt>
    <dd><p>Perform a inplace sort on <code class="docutils literal"><span class="pre">keys</span></code>.  Memory transfer is performed
    automatically.</p>
    <table class="docutils field-list" frame="void" rules="none">
    <col class="field-name" />
    <col class="field-body" />
    <tbody valign="top">
    <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
    <li><strong>keys</strong> (<em>numpy.ndarray</em>) &#8211; Keys to sort inplace</li>
    <li><strong>vals</strong> (<em>numpy.ndarray</em>) &#8211; Optional. Additional values to be reordered along the sort.
    It is modified in place. Only the <code class="docutils literal"><span class="pre">uint32</span></code> dtype is
    supported in this version.</li>
    <li><strong>begin_bit</strong> (<em>int</em>) &#8211; The first bit to sort</li>
    <li><strong>end_bit</strong> (<em>int</em>) &#8211; Optional. The last bit to sort</li>
    </ul>
    </td>
    </tr>
    </tbody>
    </table>
    </dd></dl>

    </dd></dl>

    </div>
    <div class="section" id="sorting-many-small-arrays">
    <h2>Sorting Many Small Arrays<a class="headerlink" href="#sorting-many-small-arrays" title="Permalink to this headline">¶</a></h2>
    <p>Using <a class="reference internal" href="#accelerate.cuda.sorting.RadixSort" title="accelerate.cuda.sorting.RadixSort"><code class="xref py py-class docutils literal"><span class="pre">accelerate.cuda.sorting.RadixSort</span></code></a> on small (approx. less than
    1 million items) arrays has significant overhead due to multiple kernel
    launches. A better alternative is to use
    <a class="reference internal" href="#accelerate.cuda.sorting.segmented_sort" title="accelerate.cuda.sorting.segmented_sort"><code class="xref py py-func docutils literal"><span class="pre">accelerate.cuda.sorting.segmented_sort()</span></code></a> which launches a single kernel
    for sorting a batch of many small arrays.</p>
    <dl class="function">
    <dt id="accelerate.cuda.sorting.segmented_sort">
    <code class="descclassname">accelerate.cuda.sorting.</code><code class="descname">segmented_sort</code><span class="sig-paren">(</span><em>keys</em>, <em>vals</em>, <em>segments</em>, <em>stream=0</em><span class="sig-paren">)</span><a class="headerlink" href="#accelerate.cuda.sorting.segmented_sort" title="Permalink to this definition">¶</a></dt>
    <dd><p>Performs an inplace sort on small segments (N &lt; 1e6).</p>
    <table class="docutils field-list" frame="void" rules="none">
    <col class="field-name" />
    <col class="field-body" />
    <tbody valign="top">
    <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
    <li><strong>keys</strong> (<em>numpy.ndarray</em>) &#8211; Keys to sort inplace.</li>
    <li><strong>vals</strong> (<em>numpy.ndarray</em>) &#8211; Values to be reordered inplace along the sort. Only the
    <code class="docutils literal"><span class="pre">uint32</span></code> dtype is supported in this implementation.</li>
    <li><strong>segments</strong> (<em>numpy.ndarray</em>) &#8211; Segment separation location. e.g. <code class="docutils literal"><span class="pre">array([3,</span> <span class="pre">6,</span> <span class="pre">8])</span></code> for
    segments of  <code class="docutils literal"><span class="pre">keys[:3]</span></code>, <code class="docutils literal"><span class="pre">keys[3:6]</span></code>, <code class="docutils literal"><span class="pre">keys[6:8]</span></code>,
    <code class="docutils literal"><span class="pre">keys[8:]</span></code>.</li>
    <li><strong>stream</strong> &#8211; Optional. A cuda stream in which the kernels are executed.</li>
    </ul>
    </td>
    </tr>
    </tbody>
    </table>
    </dd></dl>
