CUDA Ufuncs and Generalized Ufuncs
==================================

This page describes the CUDA ufunc-like object.

To support the programming pattern of CUDA programs, CUDA Vectorize and
GUVectorize cannot produce a conventional ufunc.  Instead, a ufunc-like
object is returned.  This object is a close analog but not fully
compatible with a regular NumPy ufunc.  The CUDA ufunc adds support for
passing intra-device arrays (already on the GPU device) to reduce
traffic over the PCI-express bus.  It also accepts a `stream` keyword
for launching in asynchronous mode.

Example: Basic Example
------------------------

.. testcode::

    import math
    from numbapro import vectorize, cuda
    import numpy as np

    @vectorize(['float32(float32, float32, float32)',
                'float64(float64, float64, float64)'],
               target='gpu')
    def cu_discriminant(a, b, c):
        return math.sqrt(b ** 2 - 4 * a * c)

    N = 1e+4
    dtype = np.float32

    # prepare the input
    A = np.array(np.random.sample(N), dtype=dtype)
    B = np.array(np.random.sample(N) + 10, dtype=dtype)
    C = np.array(np.random.sample(N), dtype=dtype)

    D = cu_discriminant(A, B, C)

    print(D)  # print result

.. testoutput::
    :hide:

    ...


Example: Calling Device Functions
----------------------------------

All CUDA ufunc kernels have the ability to call other CUDA device functions::

    from numbapro import vectorize, cuda

    # define a device function
    @cuda.jit('float32(float32, float32, float32)', device=True, inline=True)
    def cu_device_fn(x, y, z):
        return x ** y / z

    # define a ufunc that calls our device function
    @vectorize(['float32(float32, float32, float32)'], target='gpu')
    def cu_ufunc(x, y, z):
        return cu_device_fn(x, y, z)


Generalized CUDA ufuncs
-----------------------

Generalized ufuncs may be executed on the GPU using CUDA, analogous to
the CUDA ufunc functionality.  This may be accomplished as follows::

    from numbapro import guvectorize

    @guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'], 
                 '(m,n),(n,p)->(m,p)', target='gpu')
    def matmulcore(A, B, C):
        ...

There are times when the gufunc kernel uses too many of a GPU's
resources, which can cause the kernel launch to fail.  The user can
explicitly control the maximum size of the thread block by setting
the `max_blocksize` attribute on the compiled gufunc object.

::

    from numbapro import guvectorize

    @guvectorize(..., target='gpu')
    def very_complex_kernel(A, B, C):
        ...

    very_complex_kernel.max_blocksize = 32  # limits to 32 threads per block

.. comment

    Example: A Chunk at a Time
    ---------------------------

    Partitioning your data into chunks allows computation and memory transfer
    to be overlapped.  This can increase the throughput of your ufunc and
    enables your ufunc to operate on data that is larger than the memory
    capacity of your GPU.  For example:

    .. testcode::

        import math
        from numbapro import vectorize, cuda
        import numpy as np

        # the ufunc kernel
        def discriminant(a, b, c):
            return math.sqrt(b ** 2 - 4 * a * c)

        cu_discriminant = vectorize(['float32(float32, float32, float32)',
                                     'float64(float64, float64, float64)'],
                                    target='gpu')(discriminant)

        N = 1e+8
        dtype = np.float32

        # prepare the input
        A = np.array(np.random.sample(N), dtype=dtype)
        B = np.array(np.random.sample(N) + 10, dtype=dtype)
        C = np.array(np.random.sample(N), dtype=dtype)
        D = np.empty(A.shape, dtype=A.dtype)

        # create a CUDA stream
        stream = cuda.stream()

        chunksize = 1e+6
        chunkcount = N // chunksize

        # partition numpy arrays into chunks
        # no copying is performed
        sA = np.split(A, chunkcount)
        sB = np.split(B, chunkcount)
        sC = np.split(C, chunkcount)
        sD = np.split(D, chunkcount)

        device_ptrs = []

        with stream.auto_synchronize():
            # every operation in this context with be launched asynchronously
            # by using the CUDA stream

            # for each chunk
            for a, b, c, d in zip(sA, sB, sC, sD):
                # transfer to device
                dA = cuda.to_device(a, stream)
                dB = cuda.to_device(b, stream)
                dC = cuda.to_device(c, stream)
                dD = cuda.to_device(d, stream, copy=False) # no copying
                # launch kernel
                cu_discriminant(dA, dB, dC, out=dD, stream=stream)
                # retrieve result
                dD.copy_to_host(d, stream)
                # store device pointers to prevent them from freeing before
                # the kernel is scheduled
                device_ptrs.extend([dA, dB, dC, dD])

        # data is ready at this point inside D