このサイトを参考にしながら、Cupyの基礎とCupyとNumbaを組み合わせたプログラミングを学習してみる。

Cupy = Numpy + GPU¶

import numpy as np
import cupy as cp

CuPy arrays look just like NumPy arrays:

CuPyアレイは、一見、Numpyアレイのように見える。

ary = cp.arange(10).reshape((2,5))
print(repr(ary))
print(ary.dtype)
print(ary.shape)
print(ary.strides)

array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])
int64
(2, 5)
(40, 8)

This array is in the GPU memory of the default GPU (device 0). We can see this by inspecting the special device attribute:

このアレイは、デフォルトGPU(デバイス0)のGPUメモリ内に格納されている。特別なデバイス属性を調べることで、このことを確認できる。

ary.device

<CUDA Device 0>

We can move data from the CPU to the GPU using the cp.asarray() function:

cp.asarray()関数を使ってデータをCPUからGPUへ移動できる。

ary_cpu = np.arange(10)
ary_gpu = cp.asarray(ary_cpu)
print('cpu:', ary_cpu)
print('gpu:', ary_gpu)
print(ary_gpu.device)

cpu: [0 1 2 3 4 5 6 7 8 9]
gpu: [0 1 2 3 4 5 6 7 8 9]
<CUDA Device 0>

Note that when we print the contents of a GPU array, CuPy is copying the data from the GPU back to the CPU so it can print the results.

ここで留意すべきは、GPUアレイの中身をプリントする場合、結果をプリントできるように、CupyがデータをGPUからCPUに戻すことである。

If we are done with the data on the GPU, we can convert it back to a NumPy array on the CPU with the cp.asnumpy() function:

cp.asnumpy()関数を使って、GPU上の使用済みデータを変換して、CPU上のNumpyアレイに戻すことができる。

ary_cpu_returned = cp.asnumpy(ary_gpu)
print(repr(ary_cpu_returned))
print(type(ary_cpu_returned))

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
<class 'numpy.ndarray'>

GPU Array Math¶

Most of the NumPy methods are supported in CuPy with identical function names and arguments:

ほとんどのNumpyメソッドが、同一の関数と引数によってCupyでもサポートされている。

print(ary_gpu * 2)
print(cp.exp(-0.5 * ary_gpu**2))
print(cp.linalg.norm(ary_gpu))
print(cp.random.normal(loc=5, scale=2.0, size=10))

[ 0  2  4  6  8 10 12 14 16 18]
[1.00000000e+00 6.06530660e-01 1.35335283e-01 1.11089965e-02
 3.35462628e-04 3.72665317e-06 1.52299797e-08 2.28973485e-11
 1.26641655e-14 2.57675711e-18]
16.881943016134134
[3.26791987 5.06929768 3.41834602 2.58856973 5.99100382 3.67596393
 3.74813812 2.2890481  6.11608167 4.37072421]

You may notice a slight pause when you run these functions the first time. This is because CuPy has to compile the CUDA functions on the fly, and then cache them to disk for reuse in the future.

これらの関数を最初に実行する際、若干のもたつきを感じるかもしれないが、これは、CupyがオンザフライでCUDA関数をコンパイルしなければならないのと、その後、将来的な再利用のために、それらをディスクにキャッシュする必要があるからである。

That’s pretty much it! CuPy is very easy to use and has excellent documentation, which you should become familiar with.

一応の説明はこれまで！Cupyは非常に簡単に使えて、一通り目を通す価値がある素晴らしい取説を有しています。

Before we get into GPU performance measurement, let’s switch gears back to Numba.

GPU性能測定に踏み込む前に、Numbaに話を戻しましょう。

GPUメモリの管理¶

先ず、add_ufunc関数を構築する。

from numba import vectorize
import numpy as np

@vectorize(['float32(float32, float32)'], target='cuda')
def add_ufunc(x, y):
    return x + y

n = 100000
x = np.arange(n).astype(np.float32)
y = 2 * x

%timeit add_ufunc(x, y)  # Baseline performance with host arrays

1.57 ms ± 16.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

There are two ways that we can create GPU arrays to pass to Numba. Numba defines its own GPU array object (not as fully-featured as CuPy, but may be useful if you don’t need the rest of CuPy for your application). The numba.cuda module includes a function that will copy host data to the GPU and return a CUDA device array:
Numbaにデータを渡すためのGPUアレイを作成する方法が2通りある。Numbaは独自のGPUアレイオブジェクトを定義する(CuPyに比べるとお粗末ではあるがハンディーではある)。numba.cudaモジュールは、ホストデータ(CPUデータ)をGPUにコピーしてCUDAデバイスアレイを返す関数を含んでいる。

from numba import cuda

x_device = cuda.to_device(x)
y_device = cuda.to_device(y)

print(x_device)
print(x_device.shape)
print(x_device.dtype)

<numba.cuda.cudadrv.devicearray.DeviceNDArray object at 0x7f15e4bd56a0>
(100000,)
float32

Device arrays can be passed to Numba’s compiled CUDA functions just like NumPy arrays, but without the copy overhead:
デバイスアレイは、コピーオーバーヘッド無しで、Numpyアレイのように、NumbaのコンパイルされたCUDA関数に渡すことが可能だ。

%timeit add_ufunc(x_device, y_device)

771 µs ± 25.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

That’s a big performance improvement already, but we are still allocating a device array for the output of the ufunc and copying it back to the host. We can create the output buffer with the numba.cuda.device_array() function:
これで十分なようにも見えるが、ufuncの出力用にまだ、デバイスアレイを割り当てて、それをコピーしてホストに戻している。numba関数を使うことで出力バッファを作り出すことができる。

out_device = cuda.device_array(shape=(n,), dtype=np.float32)  # does not initialize the contents, like np.empty()

And then we can use a special out keyword argument to the ufunc to specify the output buffer:
その後、出力バッファを特定するために、ufuncにに対して特別なアウトキーワード引数を使用できる。

%timeit add_ufunc(x_device, y_device, out=out_device)

674 µs ± 3.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Now that we have removed the device allocation and copy steps, the computation runs much faster than before. When we want to bring the device array back to the host memory, we can use the copy_to_host() method:
デバイスアロケーションとコピーステップを省いたので、計算は前よりかなり高速になっている。デバイスアレイをホストメモリに戻したい場合、copy_to_host()を使うことができる。

out_host = out_device.copy_to_host()
print(out_host[:10])

[ 0.  3.  6.  9. 12. 15. 18. 21. 24. 27.]

CuPy Interoperability¶

Recent versions of CuPy (>= 4.5) support Numba’s generic CUDA array interface. We can see this on a CuPy array, by looking for the cuda_array_interface attribute:
CuPy最新版は、NumbaのジェネリックCUDAアレイインターフェースをサポートしている。

x_cp = cp.asarray(x)
y_cp = cp.asarray(y)
out_cp = cp.empty_like(y_cp)

x_cp.__cuda_array_interface__

{'shape': (100000,),
 'typestr': '<f4',
 'descr': [('', '<f4')],
 'data': (139731908165632, False),
 'version': 0}

This describes the CuPy array in a portable way so that other packages, like Numba, can use it:
これは、Numbaのような他のパッケージがCuPyアレイを使えるように、簡素にCuPyアレイを説明している。

add_ufunc(x_cp, y_cp, out=out_cp)

print(out_cp[:10])

[ 0.  3.  6.  9. 12. 15. 18. 21. 24. 27.]

And it runs the same speed as using the Numba device allocation:
この方法は、Numbaデバイスアロケーションを使うのと同じ速度を出せる。

%timeit add_ufunc(x_cp, y_cp, out=out_cp)

664 µs ± 3.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Note that Numba won’t automatically create a CuPy array for the ufunc output, so if you want to ensure the ufunc result is saved in a CuPy array, be sure to pass an explicit out argument to the ufunc, as shown above.
ここで留意すべきは、Numbaは、ufunc出力用に自動的にCuPyアレイを作成しないので、ufuncの結果を確実にCuPyアレイに保存したい場合、上記のようにufuncに明示的out引数を必ず渡すことである。