cupy, pycuda, skcuda, numpyの内積計算速度比較

cupyと言うとQPマヨネーズのように聞こえるが、実際にはクーパイと言うらしい。しかし、マヨラーの多い日本人なら、やはりキューピーと呼んでやりたいところだろう。今回は、このcupyを加えて、内積計算の速度比較をしてみたいと思う。ちなみに、cupyは、あのChainerで超有名な世界屈指のAIベンチャー企業であるPreferred Networksが開発したオープンソースライブラリのようで、Chainerについては、このブログでも2年3ヶ月前に紹介している。現在は、TensorFlow+Keras, PyTorch+Caffe2+onnxの二択状態なので、Chainerが出る幕は無さそうだが、しかし、2年後にはChainerが主流になっている可能性も否定できない。全てはエンジニアの頑張りにかかっている。

スポンサーリンク

cupyを速度比較に入れる

import pycuda.autoinit
import numpy as np
import pycuda.gpuarray as gpuarray
import time
import cupy
from pycuda.compiler import SourceModule
import skcuda.linalg as culinalg
import skcuda
culinalg.init()

mod = SourceModule("""    
__global__ void dot(float* C, float* A, float* B, int width) {
__shared__ float Ashare[16][16];
__shared__ float Bshare[16][16];
int bx = blockIdx.x, by = blockIdx.y;
int tx = threadIdx.x, ty = threadIdx.y;
int row = by * 16 + ty;
int col = bx * 16 + tx;
float result = 0;
for (int m = 0; m < width / 16; m++) {
//collectively load the A and B tiles into shared memory
Ashare[ty][tx] = A[(row * width) + (m * 16) + tx];
Bshare[ty][tx] = B[(((m * 16) + ty) * width) + col];
__syncthreads(); //wait for all the shared memory to be loaded

for (int k = 0; k < 16; k++) {
result += Ashare[ty][k] * Bshare[k][tx];
}
__syncthreads();
}
    C[(row * width) + col] = result;
}
""")
knl = mod.get_function("dot")
warp_size=32

results = []
for mat_width in [2**8,2**9,2**10,2**11,2**12,2**13,13500]:
#mat_width = 256
    nb_columnsA=mat_width
    nb_linesA=mat_width
    nb_columnsB=mat_width
    nb_linesB=nb_columnsA
    a=np.random.rand(nb_linesA,nb_columnsA).astype(np.float32)
    b=np.random.rand(nb_linesB,nb_columnsB).astype(np.float32)
    
    start = time.time()
    numpy_dot = np.dot(a,b)
    end = time.time()
    sec1 = end - start
    print ("CPU time", sec1)
    #print ("numpy_dot", numpy_dot)

    start = time.time()
    a_gpu = gpuarray.to_gpu(np.array(a,dtype = np.float32))
    b_gpu = gpuarray.to_gpu(np.array(b,dtype = np.float32))
    c_gpu = gpuarray.empty((nb_linesA, nb_columnsB), np.float32)
    threadPerBlockx=16
    threadPerBlocky=16
    size_Cx = nb_columnsB
    size_Cy = nb_linesA
    BlockPerGridx = (int) (size_Cx // threadPerBlockx);
    BlockPerGridy = (int) (size_Cy // threadPerBlockx);
    knl(c_gpu, a_gpu, b_gpu, np.int32(mat_width),\
       block = (threadPerBlockx, threadPerBlocky, 1), grid=(BlockPerGridx,BlockPerGridy))
    pycuda_dot = c_gpu.get()
    end = time.time()
    sec2 = end - start
    print ("GPU time", sec2)
    print('Speed difference of numpy and pycuda: {:0.1f}x'.format(sec1 / sec2))
    print (np.allclose(pycuda_dot,numpy_dot))
    
    start = time.time()
    c_gpu = gpuarray.to_gpu(np.array(a,dtype = np.float32))
    d_gpu = gpuarray.to_gpu(np.array(b,dtype = np.float32))
    ressk = culinalg.dot(c_gpu, d_gpu)
    ressk = ressk.get()
    end = time.time()
    sec3 = end - start    
    print ("SKC time", sec3)
    print('Speed difference of pycuda and skcuda: {:0.1f}x'.format(sec2 / sec3))
    print (np.allclose(numpy_dot,ressk))
    
    start = time.time()
    c = cupy.asarray(a)
    d = cupy.asarray(b)
    cupy_dot = cupy.dot(c, d)
    cupy_dot2 = cupy_dot.get()
    end = time.time()
    sec4 = end - start    
    print ("cupy time", sec4)
    print('Speed difference of cupy and skcuda: {:0.1f}x'.format(sec3 / sec4))
    print (np.allclose(numpy_dot,cupy_dot2))
    results.append([mat_width,sec1,sec2,sec3,sec4])
CPU time 0.004679679870605469
GPU time 0.010504007339477539
Speed difference of numpy and pycuda: 0.4x
True
SKC time 0.004342317581176758
Speed difference of pycuda and skcuda: 2.4x
True
cupy time 0.0065364837646484375
Speed difference of cupy and skcuda: 0.7x
True
CPU time 0.0016524791717529297
GPU time 0.0027039051055908203
Speed difference of numpy and pycuda: 0.6x
True
SKC time 0.001592874526977539
Speed difference of pycuda and skcuda: 1.7x
True
cupy time 0.0015077590942382812
Speed difference of cupy and skcuda: 1.1x
True
CPU time 0.0065996646881103516
GPU time 0.009189367294311523
Speed difference of numpy and pycuda: 0.7x
True
SKC time 0.004293203353881836
Speed difference of pycuda and skcuda: 2.1x
True
cupy time 0.004175424575805664
Speed difference of cupy and skcuda: 1.0x
True
CPU time 0.07387423515319824
GPU time 0.05774497985839844
Speed difference of numpy and pycuda: 1.3x
True
SKC time 0.02467632293701172
Speed difference of pycuda and skcuda: 2.3x
True
cupy time 0.02106499671936035
Speed difference of cupy and skcuda: 1.2x
True
CPU time 0.5532965660095215
GPU time 0.49304819107055664
Speed difference of numpy and pycuda: 1.1x
True
SKC time 0.08375120162963867
Speed difference of pycuda and skcuda: 5.9x
True
cupy time 0.09917664527893066
Speed difference of cupy and skcuda: 0.8x
True
CPU time 6.68488621711731
GPU time 3.4541449546813965
Speed difference of numpy and pycuda: 1.9x
True
SKC time 0.7705996036529541
Speed difference of pycuda and skcuda: 4.5x
True
cupy time 0.6721401214599609
Speed difference of cupy and skcuda: 1.1x
True
CPU time 26.387485027313232
GPU time 13.792988777160645
Speed difference of numpy and pycuda: 1.9x
False
SKC time 2.4380271434783936
Speed difference of pycuda and skcuda: 5.7x
True
cupy time 2.4123611450195312
Speed difference of cupy and skcuda: 1.0x
True

結果をグラフにする

import matplotlib.pyplot as plt
import numpy as np

results = np.array(results)
legends = []
nH = results[:7, 0:1]
rows = results[:7,1:6]
plt.semilogx(nH,rows, 'o-')
legends += ['' + s for s in ['CPU','GPU','SK-CUDA','cupy']]
plt.rcParams['figure.figsize'] = 18, 10
plt.rcParams["font.size"] = "20"
plt.ylabel('Seconds')
plt.xlabel('Value of mat_width')
plt.legend(legends);

scikit-cudaとcupyは速度的にほぼ同じだったが、使い勝手の良さはcupyの圧勝だったので、内積計算にはcupyを使うのがベストな選択と言えるだろう。さすがpreferred networksといったところか。国産機械学習フレームワークであるchainerにも是非頑張ってもらいたい。


参考サイトhttps://devtalk.nvidia.com/