= CuPy =

CuPy
- Logo: CuPy Logo.png
- Author: Seiya Tokui
- Developer: Community, Preferred Networks, Inc.
- Released: .
- Latest Release Version: v13.6.0
- Programming Language: Python, Cython, CUDA
- Repo: http://github.com/cupy/cupy
- Operating System: Linux, Windows
- Platform: Cross-platform
- Genre: Numerical analysis
- License: MIT

CuPy is an open source library for GPU-accelerated computing with Python programming language, providing support for multi-dimensional arrays, sparse matrices, and a variety of numerical algorithms implemented on top of them.
CuPy shares the same API set as NumPy and SciPy, allowing it to be a drop-in replacement to run NumPy/SciPy code on GPU. CuPy supports Nvidia CUDA GPU platform, and AMD ROCm GPU platform starting in v9.0.

CuPy has been initially developed as a backend of Chainer deep learning framework, and later established as an independent project in 2017.

CuPy is a part of the NumPy ecosystem array libraries and is widely adopted to utilize GPU with Python, especially in high-performance computing environments such as Summit, Perlmutter, EULER, and ABCI.

CuPy is a NumFOCUS sponsored project.

== Features ==

CuPy implements NumPy/SciPy-compatible APIs, as well as features to write user-defined GPU kernels or access low-level APIs.

=== NumPy-compatible APIs ===

The same set of APIs defined in the NumPy package () are available under package.

- Multi-dimensional array () for boolean, integer, float, and complex data types
- Module-level functions
- Linear algebra functions
- Fast Fourier transform
- Random number generator

=== SciPy-compatible APIs ===

The same set of APIs defined in the SciPy package () are available under package.

- Sparse matrices () of CSR, COO, CSC, and DIA format
- Discrete Fourier transform
- Advanced linear algebra
- Multidimensional image processing
- Sparse linear algebra
- Special functions
- Signal processing
- Statistical functions

=== User-defined GPU kernels ===

- Kernel templates for element-wise and reduction operations
- Raw kernel (CUDA C/C++)
- Just-in-time transpiler (JIT)
- Kernel fusion

=== Distributed computing ===

- Distributed communication package (), providing collective and peer-to-peer primitives

=== Low-level CUDA features ===

- Stream and event
- Memory pool
- Profiler
- Host API binding
- CUDA Python support

=== Interoperability ===

- DLPack
- CUDA Array Interface
- NEP 13 ()
- NEP 18 ()
- Array API Standard

== Examples ==

=== Array creation ===
<syntaxhighlight lang="numpy">
>>> import cupy as cp
>>> x = cp.array([1, 2, 3])
>>> x
array([1, 2, 3])
>>> y = cp.arange(10)
>>> y
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
</syntaxhighlight>

=== Basic operations ===
<syntaxhighlight lang="numpy">
>>> import cupy as cp
>>> x = cp.arange(12).reshape(3, 4).astype(cp.float32)
>>> x
array([[ 0., 1., 2., 3.],
       [ 4., 5., 6., 7.],
       [ 8., 9., 10., 11.]], dtype=float32)
>>> x.sum(axis=1)
array([ 6., 22., 38.], dtype=float32)
</syntaxhighlight>

=== Raw CUDA C/C++ kernel ===
<syntaxhighlight lang="numpy">
>>> import cupy as cp
>>> kern = cp.RawKernel(r
... extern "C" __global__
... void multiply_elemwise(const float* in1, const float* in2, float* out) {
... int tid = blockDim.x * blockIdx.x + threadIdx.x;
... out[tid] = in1[tid] * in2[tid];
... }
... , 'multiply_elemwise')
>>> in1 = cp.arange(16, dtype=cp.float32).reshape(4, 4)
>>> in2 = cp.arange(16, dtype=cp.float32).reshape(4, 4)
>>> out = cp.zeros((4, 4), dtype=cp.float32)
>>> kern((4,), (4,), (in1, in2, out)) # grid, block and arguments
>>> out
array([[ 0., 1., 4., 9.],
       [ 16., 25., 36., 49.],
       [ 64., 81., 100., 121.],
       [144., 169., 196., 225.]], dtype=float32)
</syntaxhighlight>

== Applications ==

- spaCy
- XGBoost
- (Berkeley SETI)
- NVIDIA RAPIDS
- scikit-learn
- MONAI
- Chainer

== See also ==

- Array programming
- List of numerical-analysis software
- Dask
