今日はこのサイトのコードを参考にしてPyCUDAのExclusiveScanKernelを使ってみる。

codeをloadする¶

先ずは使用するソースコードを先のサイトからロードする。

cd git/gpu/Code/Python/PyCUDA

/home/workspace/1/git/gpu/Code/Python/PyCUDA

ls

DemoMetaCodepy.py              demo.py               hello_gpu.py
DumpProperties.py              demohandler.py        pycurand.py
MatmulSimple.py                demoshort.py          reduction.py
MeasureGpuarraySpeedRandom.py  functiontemplates.py  scan.py

# %load scan.py
import pycuda.gpuarray as gpuarray
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
from pycuda.scan import InclusiveScanKernel

knl = InclusiveScanKernel(np.int32, "a+b")

n = 2**20-2**18+5
host_data = np.random.randint(0, 10, n).astype(np.int32)
dev_data = gpuarray.to_gpu(host_data)

knl(dev_data)
assert (dev_data.get() == np.cumsum(host_data, axis=0)).all()

コードを書き換える¶

このサイトを参考にしてコードを書き換える。

# %load scan.py
import pycuda.gpuarray as gpuarray
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
import time
from pycuda.scan import InclusiveScanKernel

N = pow(2,15)
NArrays = 5

n = 2**20-2**18+5

arrayDList = []
for i in range(NArrays):
    host_data = np.random.randint(i, 10, n).astype(np.int32)
    arrayDList.append(gpuarray.to_gpu(host_data))

krn = InclusiveScanKernel(np.int32,"a+b")

for i in range(NArrays):
    time1 = time.time()
    a = krn(arrayDList[i])
    time2 = time.time()
    print ("time = " + str(time2-time1))
    print(a.get())

time = 0.0006561279296875
[      7      14      15 ... 3538912 3538913 3538913]
time = 0.0005059242248535156
[      1       9      13 ... 3935508 3935511 3935519]
time = 0.0004506111145019531
[      2       8      11 ... 4324395 4324400 4324402]
time = 0.000591278076171875
[      5       9      17 ... 4717889 4717894 4717899]
time = 0.00037217140197753906
[      6      10      17 ... 5111789 5111795 5111800]

ElementwiseKernelを使う¶

from pycuda.elementwise import ElementwiseKernel

NArrays = 5
n = 2**20-2**18+5

arrayDList = []
for i in range(NArrays):
    host_data = np.random.randint(i, 10, n).astype(np.int32)
    arrayDList.append(gpuarray.to_gpu(host_data))

knr = ElementwiseKernel(arguments = "np.int32",
                               operation = "a+b")

for i in range(NArrays):
    time1 = time.time()
    a = krn(arrayDList[i])
    time2 = time.time()
    print ("time = " + str(time2-time1))
    print(a)

time = 0.0006115436553955078
[      8      13      22 ... 3540841 3540849 3540853]
time = 0.0003833770751953125
[      6      14      16 ... 3932299 3932307 3932313]
time = 0.0003941059112548828
[      4       8      10 ... 4325732 4325734 4325742]
time = 0.0003783702850341797
[      6      11      17 ... 4723416 4723419 4723427]
time = 0.0003750324249267578
[      8      12      20 ... 5110670 5110679 5110685]

InclusiveScanKernelとElementwiseKernelは速度がほとんど変わらない。