Writing CUDA-Python
=========================

.. raw:: html

	<p>The CUDA JIT is a low-level entry point to the CUDA features in NumbaPro.
	It translates Python functions into <a class="reference external" href="http://en.wikipedia.org/wiki/Parallel_Thread_Execution">PTX</a> code which execute on
	the CUDA hardware.  The <cite>jit</cite> decorator is applied to Python functions written
	in our <a class="reference external" href="CUDAPySpec.html">Python dialect for CUDA</a>.
	NumbaPro interacts with the <a class="reference external" href="http://docs.nvidia.com/cuda/cuda-driver-api/index.html">CUDA Driver API</a> to load the PTX onto
	the CUDA device and execute.</p>
	<div class="section" id="imports">
	<h2>Imports<a class="headerlink" href="#imports" title="Permalink to this headline">¶</a></h2>
	<p>Most of the CUDA public API for CUDA features are exposed in the
	<code class="docutils literal"><span class="pre">numbapro.cuda</span></code> module:</p>
	<div class="highlight-python"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">numbapro</span> <span class="kn">import</span> <span class="n">cuda</span>
	</pre></div>
	</div>
	</div>
	<div class="section" id="compiling">
	<h2>Compiling<a class="headerlink" href="#compiling" title="Permalink to this headline">¶</a></h2>
	<p>CUDA kernels and device functions are compiled by decorating a Python
	function with the jit or <cite>autojit</cite> decorators.</p>
	</div>
	<div class="section" id="thread-identity-by-cuda-intrinsics">
	<h2>Thread Identity by CUDA Intrinsics<a class="headerlink" href="#thread-identity-by-cuda-intrinsics" title="Permalink to this headline">¶</a></h2>
	<p>A set of CUDA intrinsics is used to identify the current execution thread.
	These intrinsics are meaningful inside a CUDA kernel or device function only.
	A common pattern to assign the computation of each element in the output array
	to a thread.</p>
	<p>For a 1D grid:</p>
	<div class="highlight-python"><div class="highlight"><pre><span></span><span class="n">tx</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">.</span><span class="n">threadIdx</span><span class="o">.</span><span class="n">x</span>
	<span class="n">bx</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">.</span><span class="n">blockIdx</span><span class="o">.</span><span class="n">x</span>
	<span class="n">bw</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">.</span><span class="n">blockDim</span><span class="o">.</span><span class="n">x</span>
	<span class="n">i</span> <span class="o">=</span> <span class="n">tx</span> <span class="o">+</span> <span class="n">bx</span> <span class="o">*</span> <span class="n">bw</span>
	<span class="n">array</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">something</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
	</pre></div>
	</div>
	<p>For a 2D grid:</p>
	<div class="highlight-python"><div class="highlight"><pre><span></span><span class="n">tx</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">.</span><span class="n">threadIdx</span><span class="o">.</span><span class="n">x</span>
	<span class="n">ty</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">.</span><span class="n">threadIdx</span><span class="o">.</span><span class="n">y</span>
	<span class="n">bx</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">.</span><span class="n">blockIdx</span><span class="o">.</span><span class="n">x</span>
	<span class="n">by</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">.</span><span class="n">blockIdx</span><span class="o">.</span><span class="n">y</span>
	<span class="n">bw</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">.</span><span class="n">blockDim</span><span class="o">.</span><span class="n">x</span>
	<span class="n">bh</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">.</span><span class="n">blockDim</span><span class="o">.</span><span class="n">y</span>
	<span class="n">x</span> <span class="o">=</span> <span class="n">tx</span> <span class="o">+</span> <span class="n">bx</span> <span class="o">*</span> <span class="n">bw</span>
	<span class="n">y</span> <span class="o">=</span> <span class="n">ty</span> <span class="o">+</span> <span class="n">by</span> <span class="o">*</span> <span class="n">bh</span>
	<span class="n">array</span><span class="p">[</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">]</span> <span class="o">=</span> <span class="n">something</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
	</pre></div>
	</div>
	<p>Since these patterns are so common, there is a shorthand function to produce
	the same result.</p>
	<p>For a 1D grid:</p>
	<div class="highlight-python"><div class="highlight"><pre><span></span><span class="n">i</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">.</span><span class="n">grid</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
	<span class="n">array</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">something</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
	</pre></div>
	</div>
	<p>For a 2D grid:</p>
	<div class="highlight-python"><div class="highlight"><pre><span></span><span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">.</span><span class="n">grid</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
	<span class="n">array</span><span class="p">[</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">]</span> <span class="o">=</span> <span class="n">something</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
	</pre></div>
	</div>
	</div>
	<div class="section" id="memory-transfer">
	<h2>Memory Transfer<a class="headerlink" href="#memory-transfer" title="Permalink to this headline">¶</a></h2>
	<p>By default, any NumPy arrays used as argument of a CUDA kernel is transferred
	automatically to and from the device.  However, to achieve maximum performance
	and minimizing redundant memory transfer,
	user should manage the memory transfer explicitly.</p>
	<p>Host-&gt;device transfers are asynchronous to the host.
	Device-&gt;host transfers are synchronous to the host.
	If a non-zero <a class="reference internal" href="#cuda-stream">CUDA stream</a> is provided, the transfer becomes asynchronous.</p>
	<p>The following are special DeviceNDArray factories:</p>
	</div>
	<div class="section" id="memory-lifetime">
	<h2>Memory Lifetime<a class="headerlink" href="#memory-lifetime" title="Permalink to this headline">¶</a></h2>
	<p>The live time of a device array is bound to the lifetime of the
	<cite>DeviceNDArray</cite> instance.</p>
	</div>
	<div class="section" id="cuda-stream">
	<h2>CUDA Stream<a class="headerlink" href="#cuda-stream" title="Permalink to this headline">¶</a></h2>
	<p>A CUDA stream is a command queue for the CUDA device.  By specifying a stream,
	the CUDA API calls become asynchronous, meaning that the call may return before
	the command has been completed.  Memory transfer instructions and kernel
	invocation can use CUDA stream:</p>
	<div class="highlight-python"><div class="highlight"><pre><span></span><span class="n">stream</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">.</span><span class="n">stream</span><span class="p">()</span>
	<span class="n">devary</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">.</span><span class="n">to_device</span><span class="p">(</span><span class="n">an_array</span><span class="p">,</span> <span class="n">stream</span><span class="o">=</span><span class="n">stream</span><span class="p">)</span>
	<span class="n">a_cuda_kernel</span><span class="p">[</span><span class="n">griddim</span><span class="p">,</span> <span class="n">blockdim</span><span class="p">,</span> <span class="n">stream</span><span class="p">](</span><span class="n">devary</span><span class="p">)</span>
	<span class="n">cuda</span><span class="o">.</span><span class="n">copy_to_host</span><span class="p">(</span><span class="n">an_array</span><span class="p">,</span> <span class="n">stream</span><span class="o">=</span><span class="n">stream</span><span class="p">)</span>
	<span class="c1"># data may not be available in an_array</span>
	<span class="n">stream</span><span class="o">.</span><span class="n">synchronize</span><span class="p">()</span>
	<span class="c1"># data available in an_array</span>
	</pre></div>
	</div>
	<p>An alternative syntax is available for use with a python context:</p>
	<div class="highlight-python"><div class="highlight"><pre><span></span><span class="n">stream</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">.</span><span class="n">stream</span><span class="p">()</span>
	<span class="k">with</span> <span class="n">stream</span><span class="o">.</span><span class="n">auto_synchronize</span><span class="p">():</span>
	    <span class="n">devary</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">.</span><span class="n">to_device</span><span class="p">(</span><span class="n">an_array</span><span class="p">,</span> <span class="n">stream</span><span class="o">=</span><span class="n">stream</span><span class="p">)</span>
	    <span class="n">a_cuda_kernel</span><span class="p">[</span><span class="n">griddim</span><span class="p">,</span> <span class="n">blockdim</span><span class="p">,</span> <span class="n">stream</span><span class="p">](</span><span class="n">devary</span><span class="p">)</span>
	    <span class="n">devary</span><span class="o">.</span><span class="n">copy_to_host</span><span class="p">(</span><span class="n">an_array</span><span class="p">,</span> <span class="n">stream</span><span class="o">=</span><span class="n">stream</span><span class="p">)</span>
	<span class="c1"># data available in an_array</span>
	</pre></div>
	</div>
	<p>When the python <code class="docutils literal"><span class="pre">with</span></code> context exits, the stream is automatically synchronized.</p>
	</div>
	<div class="section" id="shared-memory">
	<h2>Shared Memory<a class="headerlink" href="#shared-memory" title="Permalink to this headline">¶</a></h2>
	<p>For maximum performance, a CUDA kernel needs to use shared memory for manual caching of data.  CUDA JIT supports the use of <code class="docutils literal"><span class="pre">cuda.shared.array(shape,</span> <span class="pre">dtype)</span></code> for specifying an NumPy-array-like object inside a kernel.</p>
	<p>For example::</p>
	<div class="highlight-python"><div class="highlight"><pre><span></span><span class="n">bpg</span> <span class="o">=</span> <span class="mi">50</span>
	<span class="n">tpb</span> <span class="o">=</span> <span class="mi">32</span>
	<span class="n">n</span> <span class="o">=</span> <span class="n">bpg</span> <span class="o">*</span> <span class="n">tpb</span>

	<span class="nd">@jit</span><span class="p">(</span><span class="n">argtypes</span><span class="o">=</span><span class="p">[</span><span class="n">float32</span><span class="p">[:,:],</span> <span class="n">float32</span><span class="p">[:,:],</span> <span class="n">float32</span><span class="p">[:,:]],</span> <span class="n">target</span><span class="o">=</span><span class="s1">&#39;gpu&#39;</span><span class="p">)</span>
	<span class="k">def</span> <span class="nf">cu_square_matrix_mul</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">B</span><span class="p">,</span> <span class="n">C</span><span class="p">):</span>
	    <span class="n">sA</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">.</span><span class="n">shared</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="n">tpb</span><span class="p">,</span> <span class="n">tpb</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">float32</span><span class="p">)</span>
	    <span class="n">sB</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">.</span><span class="n">shared</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="n">tpb</span><span class="p">,</span> <span class="n">tpb</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">float32</span><span class="p">)</span>

	    <span class="n">tx</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">.</span><span class="n">threadIdx</span><span class="o">.</span><span class="n">x</span>
	    <span class="n">ty</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">.</span><span class="n">threadIdx</span><span class="o">.</span><span class="n">y</span>
	    <span class="n">bx</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">.</span><span class="n">blockIdx</span><span class="o">.</span><span class="n">x</span>
	    <span class="n">by</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">.</span><span class="n">blockIdx</span><span class="o">.</span><span class="n">y</span>
	    <span class="n">bw</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">.</span><span class="n">blockDim</span><span class="o">.</span><span class="n">x</span>
	    <span class="n">bh</span> <span class="o">=</span> <span class="n">cuda</span><span class="o">.</span><span class="n">blockDim</span><span class="o">.</span><span class="n">y</span>

	    <span class="n">x</span> <span class="o">=</span> <span class="n">tx</span> <span class="o">+</span> <span class="n">bx</span> <span class="o">*</span> <span class="n">bw</span>
	    <span class="n">y</span> <span class="o">=</span> <span class="n">ty</span> <span class="o">+</span> <span class="n">by</span> <span class="o">*</span> <span class="n">bh</span>

	    <span class="n">acc</span> <span class="o">=</span> <span class="mf">0.</span>
	    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">bpg</span><span class="p">):</span>
	        <span class="k">if</span> <span class="n">x</span> <span class="o">&lt;</span> <span class="n">n</span> <span class="ow">and</span> <span class="n">y</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">:</span>
	            <span class="n">sA</span><span class="p">[</span><span class="n">ty</span><span class="p">,</span> <span class="n">tx</span><span class="p">]</span> <span class="o">=</span> <span class="n">A</span><span class="p">[</span><span class="n">y</span><span class="p">,</span> <span class="n">tx</span> <span class="o">+</span> <span class="n">i</span> <span class="o">*</span> <span class="n">tpb</span><span class="p">]</span>
	            <span class="n">sB</span><span class="p">[</span><span class="n">ty</span><span class="p">,</span> <span class="n">tx</span><span class="p">]</span> <span class="o">=</span> <span class="n">B</span><span class="p">[</span><span class="n">ty</span> <span class="o">+</span> <span class="n">i</span> <span class="o">*</span> <span class="n">tpb</span><span class="p">,</span> <span class="n">x</span><span class="p">]</span>

	        <span class="n">cuda</span><span class="o">.</span><span class="n">syncthreads</span><span class="p">()</span>

	        <span class="k">if</span> <span class="n">x</span> <span class="o">&lt;</span> <span class="n">n</span> <span class="ow">and</span> <span class="n">y</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">:</span>
	            <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">tpb</span><span class="p">):</span>
	                <span class="n">acc</span> <span class="o">+=</span> <span class="n">sA</span><span class="p">[</span><span class="n">ty</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span> <span class="o">*</span> <span class="n">sB</span><span class="p">[</span><span class="n">j</span><span class="p">,</span> <span class="n">tx</span><span class="p">]</span>

	        <span class="n">cuda</span><span class="o">.</span><span class="n">syncthreads</span><span class="p">()</span>

	    <span class="k">if</span> <span class="n">x</span> <span class="o">&lt;</span> <span class="n">n</span> <span class="ow">and</span> <span class="n">y</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">:</span>
	        <span class="n">C</span><span class="p">[</span><span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="n">acc</span>
	</pre></div>
	</div>
	<p>The equivalent code in CUDA-C would be:</p>
	<div class="highlight-c"><div class="highlight"><pre><span></span><span class="cp">#define pos2d(Y, X, W) ((Y) * (W) + (X))</span>

	<span class="k">const</span> <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">BPG</span> <span class="o">=</span> <span class="mi">50</span><span class="p">;</span>
	<span class="k">const</span> <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">TPB</span> <span class="o">=</span> <span class="mi">32</span><span class="p">;</span>
	<span class="k">const</span> <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">N</span> <span class="o">=</span> <span class="n">BPG</span> <span class="o">*</span> <span class="n">TPB</span><span class="p">;</span>

	<span class="n">__global__</span>
	<span class="kt">void</span> <span class="nf">cuMatrixMul</span><span class="p">(</span><span class="k">const</span> <span class="kt">float</span> <span class="n">A</span><span class="p">[],</span> <span class="k">const</span> <span class="kt">float</span> <span class="n">B</span><span class="p">[],</span> <span class="kt">float</span> <span class="n">C</span><span class="p">[]){</span>
	    <span class="n">__shared__</span> <span class="kt">float</span> <span class="n">sA</span><span class="p">[</span><span class="n">TPB</span> <span class="o">*</span> <span class="n">TPB</span><span class="p">];</span>
	    <span class="n">__shared__</span> <span class="kt">float</span> <span class="n">sB</span><span class="p">[</span><span class="n">TPB</span> <span class="o">*</span> <span class="n">TPB</span><span class="p">];</span>

	    <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">tx</span> <span class="o">=</span> <span class="n">threadIdx</span><span class="p">.</span><span class="n">x</span><span class="p">;</span>
	    <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">ty</span> <span class="o">=</span> <span class="n">threadIdx</span><span class="p">.</span><span class="n">y</span><span class="p">;</span>
	    <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">bx</span> <span class="o">=</span> <span class="n">blockIdx</span><span class="p">.</span><span class="n">x</span><span class="p">;</span>
	    <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">by</span> <span class="o">=</span> <span class="n">blockIdx</span><span class="p">.</span><span class="n">y</span><span class="p">;</span>
	    <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">bw</span> <span class="o">=</span> <span class="n">blockDim</span><span class="p">.</span><span class="n">x</span><span class="p">;</span>
	    <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">bh</span> <span class="o">=</span> <span class="n">blockDim</span><span class="p">.</span><span class="n">y</span><span class="p">;</span>

	    <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="n">tx</span> <span class="o">+</span> <span class="n">bx</span> <span class="o">*</span> <span class="n">bw</span><span class="p">;</span>
	    <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">y</span> <span class="o">=</span> <span class="n">ty</span> <span class="o">+</span> <span class="n">by</span> <span class="o">*</span> <span class="n">bh</span><span class="p">;</span>

	    <span class="kt">float</span> <span class="n">acc</span> <span class="o">=</span> <span class="mf">0.0</span><span class="p">;</span>

	    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">BPG</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
	        <span class="k">if</span> <span class="p">(</span><span class="n">x</span> <span class="o">&lt;</span> <span class="n">N</span> <span class="n">and</span> <span class="n">y</span> <span class="o">&lt;</span> <span class="n">N</span><span class="p">)</span> <span class="p">{</span>
	            <span class="n">sA</span><span class="p">[</span><span class="n">pos2d</span><span class="p">(</span><span class="n">ty</span><span class="p">,</span> <span class="n">tx</span><span class="p">,</span> <span class="n">TPB</span><span class="p">)]</span> <span class="o">=</span> <span class="n">A</span><span class="p">[</span><span class="n">pos2d</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">tx</span> <span class="o">+</span> <span class="n">i</span> <span class="o">*</span> <span class="n">TPB</span><span class="p">,</span> <span class="n">N</span><span class="p">)];</span>
	            <span class="n">sB</span><span class="p">[</span><span class="n">pos2d</span><span class="p">(</span><span class="n">ty</span><span class="p">,</span> <span class="n">tx</span><span class="p">,</span> <span class="n">TPB</span><span class="p">)]</span> <span class="o">=</span> <span class="n">B</span><span class="p">[</span><span class="n">pos2d</span><span class="p">(</span><span class="n">ty</span> <span class="o">+</span> <span class="n">i</span> <span class="o">*</span> <span class="n">TPB</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">N</span><span class="p">)];</span>
	        <span class="p">}</span>
	        <span class="n">__syncthreads</span><span class="p">();</span>
	        <span class="k">if</span> <span class="p">(</span><span class="n">x</span> <span class="o">&lt;</span> <span class="n">N</span> <span class="n">and</span> <span class="n">y</span> <span class="o">&lt;</span> <span class="n">N</span><span class="p">)</span> <span class="p">{</span>
	            <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o">&lt;</span> <span class="n">TPB</span><span class="p">;</span> <span class="o">++</span><span class="n">j</span><span class="p">)</span> <span class="p">{</span>
	                <span class="n">acc</span> <span class="o">+=</span> <span class="n">sA</span><span class="p">[</span><span class="n">pos2d</span><span class="p">(</span><span class="n">ty</span><span class="p">,</span> <span class="n">j</span><span class="p">,</span> <span class="n">TPB</span><span class="p">)]</span> <span class="o">*</span> <span class="n">sB</span><span class="p">[</span><span class="n">pos2d</span><span class="p">(</span><span class="n">j</span><span class="p">,</span> <span class="n">tx</span><span class="p">,</span> <span class="n">TPB</span><span class="p">)];</span>
	            <span class="p">}</span>
	        <span class="p">}</span>
	        <span class="n">__syncthreads</span><span class="p">();</span>
	    <span class="p">}</span>

	    <span class="k">if</span> <span class="p">(</span><span class="n">x</span> <span class="o">&lt;</span> <span class="n">N</span> <span class="n">and</span> <span class="n">y</span> <span class="o">&lt;</span> <span class="n">N</span><span class="p">)</span> <span class="p">{</span>
	        <span class="n">C</span><span class="p">[</span><span class="n">pos2d</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">N</span><span class="p">)]</span> <span class="o">=</span> <span class="n">acc</span><span class="p">;</span>
	    <span class="p">}</span>
	<span class="p">}</span>
	</pre></div>
	</div>
	<p>The return value of <code class="docutils literal"><span class="pre">cuda.shared.array</span></code> is a NumPy-array-like object.  The <code class="docutils literal"><span class="pre">shape</span></code> argument  is similar as in NumPy API, with the requirement that it must contain a constant expression.  The <cite>dtype</cite> argument takes Numba types.</p>
	</div>
	<div class="section" id="synchronization-primitives">
	<h2>Synchronization Primitives<a class="headerlink" href="#synchronization-primitives" title="Permalink to this headline">¶</a></h2>
	<p>We currently support <code class="docutils literal"><span class="pre">cuda.syncthreads()</span></code> only.  It is the same as <code class="docutils literal"><span class="pre">__syncthreads()</span></code> in CUDA-C.</p>
