気が付いたらPyCudaがいつの間にかPython3.7に対応していた。仕事が早くてびっくりした。python3.7に対応していないというクレームが付いて、boostに問題があるとかなんとかいう話になって、pybind11を使えば何とかなるかもしれないとかいう返答を聞いた時は、これは来年までお預けだなと思っていたが、あっさり解決されていた。

!pip3 install git+https://github.com/inducer/pycuda.git

Collecting git+https://github.com/inducer/pycuda.git
  Cloning https://github.com/inducer/pycuda.git to /tmp/pip-req-build-f1hyehgb
Collecting pytools>=2011.2 (from pycuda==2018.1.1)
  Downloading https://files.pythonhosted.org/packages/90/6a/7b706e4730db0ee5724c677cceafcac1bc9710c61612442a689e7b0aa5c4/pytools-2018.5.2.tar.gz (58kB)
    100% |████████████████████████████████| 61kB 4.3MB/s ta 0:00:01
Collecting pytest>=2 (from pycuda==2018.1.1)
  Downloading https://files.pythonhosted.org/packages/0c/9a/592314ceda78f3307afb6cf56d7fdbb92c5a5960a88a6d2fd25c11312ead/pytest-3.8.1-py2.py3-none-any.whl (209kB)
    100% |████████████████████████████████| 215kB 9.7MB/s eta 0:00:01
Requirement already satisfied: decorator>=3.2.0 in /root/.pyenv/versions/3.7.0/envs/py37/lib/python3.7/site-packages (from pycuda==2018.1.1) (4.3.0)
Collecting appdirs>=1.4.0 (from pycuda==2018.1.1)
  Downloading https://files.pythonhosted.org/packages/56/eb/810e700ed1349edde4cbdc1b2a21e28cdf115f9faf263f6bbf8447c1abf3/appdirs-1.4.3-py2.py3-none-any.whl
Collecting mako (from pycuda==2018.1.1)
  Downloading https://files.pythonhosted.org/packages/eb/f3/67579bb486517c0d49547f9697e36582cd19dafb5df9e687ed8e22de57fa/Mako-1.0.7.tar.gz (564kB)
    100% |████████████████████████████████| 573kB 14.3MB/s ta 0:00:01
Requirement already satisfied: six>=1.8.0 in /root/.pyenv/versions/3.7.0/envs/py37/lib/python3.7/site-packages (from pytools>=2011.2->pycuda==2018.1.1) (1.11.0)
Requirement already satisfied: numpy>=1.6.0 in /root/.pyenv/versions/3.7.0/envs/py37/lib/python3.7/site-packages (from pytools>=2011.2->pycuda==2018.1.1) (1.15.2)
Collecting py>=1.5.0 (from pytest>=2->pycuda==2018.1.1)
  Downloading https://files.pythonhosted.org/packages/c8/47/d179b80ab1dc1bfd46a0c87e391be47e6c7ef5831a9c138c5c49d1756288/py-1.6.0-py2.py3-none-any.whl (83kB)
    100% |████████████████████████████████| 92kB 14.5MB/s ta 0:00:01
Requirement already satisfied: setuptools in /root/.pyenv/versions/3.7.0/envs/py37/lib/python3.7/site-packages (from pytest>=2->pycuda==2018.1.1) (39.0.1)
Collecting more-itertools>=4.0.0 (from pytest>=2->pycuda==2018.1.1)
  Downloading https://files.pythonhosted.org/packages/79/b1/eace304ef66bd7d3d8b2f78cc374b73ca03bc53664d78151e9df3b3996cc/more_itertools-4.3.0-py3-none-any.whl (48kB)
    100% |████████████████████████████████| 51kB 12.7MB/s ta 0:00:01
Collecting atomicwrites>=1.0 (from pytest>=2->pycuda==2018.1.1)
  Downloading https://files.pythonhosted.org/packages/3a/9a/9d878f8d885706e2530402de6417141129a943802c084238914fa6798d97/atomicwrites-1.2.1-py2.py3-none-any.whl
Collecting attrs>=17.4.0 (from pytest>=2->pycuda==2018.1.1)
  Downloading https://files.pythonhosted.org/packages/3a/e1/5f9023cc983f1a628a8c2fd051ad19e76ff7b142a0faf329336f9a62a514/attrs-18.2.0-py2.py3-none-any.whl
Collecting pluggy>=0.7 (from pytest>=2->pycuda==2018.1.1)
  Downloading https://files.pythonhosted.org/packages/f5/f1/5a93c118663896d83f7bcbfb7f657ce1d0c0d617e6b4a443a53abcc658ca/pluggy-0.7.1-py2.py3-none-any.whl
Requirement already satisfied: MarkupSafe>=0.9.2 in /root/.pyenv/versions/3.7.0/envs/py37/lib/python3.7/site-packages (from mako->pycuda==2018.1.1) (1.0)
Installing collected packages: appdirs, pytools, py, more-itertools, atomicwrites, attrs, pluggy, pytest, mako, pycuda
  Running setup.py install for pytools ... done
  Running setup.py install for mako ... done
  Running setup.py install for pycuda ... done
Successfully installed appdirs-1.4.3 atomicwrites-1.2.1 attrs-18.2.0 mako-1.0.7 more-itertools-4.3.0 pluggy-0.7.1 py-1.6.0 pycuda-2018.1.1 pytest-3.8.1 pytools-2018.5.2

試しに適当なコードを走らせてみる。

import time
import numpy as np
from pycuda import driver, compiler, gpuarray, tools
import math
# import matplotlib.pyplot as plt
from sys import getsizeof
# -- initialize the device
import pycuda.autoinit
# -----------------------------------------------------
# CUDA parameters
kernel_code_template = """
    __global__  void MatProd(float* C, float* A, float* B, int dimAx, int dimBx, int dimCx, int dimCy)
    {
      int row = blockDim.y*blockIdx.y+threadIdx.y;
      int col = blockDim.x*blockIdx.x+threadIdx.x;

        double Result = 0;

        if (row<=dimCy-1 && col<=dimCx-1)
        {
                for (int k = 0; k < dimAx; k++)
                {
                        Result += A[k + dimAx*row] * B[col + dimBx*k];
                }

                C[col + row*dimCx] = Result;
        }
    }
    """
# get the kernel code from the template
kernel_code=kernel_code_template
# compile the kernel code
mod = compiler.SourceModule(kernel_code)
# get the kernel function from the compiled module
MatProd = mod.get_function("MatProd")
warp_size=32 # Warp size on the GPU.
    # --------------------------------------------------------------------
    # --------------------BEGIN of INITIALISATION-------------------------
    # --------------------------------------------------------------------

    # We create the python matrices for the computation C=A*B
    # This part is supposed as an input, so we don't take in account any computation
    # time here.

nb_columnsA=1024
nb_linesA=1024

nb_columnsB=1024
nb_linesB=nb_columnsA

a_cpu=np.random.rand(nb_linesA,nb_columnsA).astype(np.float32)
b_cpu=np.random.rand(nb_linesB,nb_columnsB).astype(np.float32)

    # --------------------------------------------------------------------
    # --------------------End of INITIALISATION---------------------------
    # --------------------------------------------------------------------

    # --------------------------------------------------------------------
    # --------------------CUDA PART---------------------------------------
    # --------------------------------------------------------------------
    # We send the data to the GPU
total_CUDA_time_Begin=time.clock()
time_memory_alloc_GPU_Begin=time.clock()
a_gpu = gpuarray.to_gpu(a_cpu)
b_gpu=gpuarray.to_gpu(b_cpu)
    # We allocate the memory on the GPU for the result C=A*B
c_gpu = gpuarray.empty((nb_linesA, nb_columnsB), np.float32)
time_memory_alloc_GPU_End=time.clock()
    # ----------------------------------------------------------
    # Starting of the CUDA computation :
    # We reserve the number of threads per block on the memory
threadPerBlockx=warp_size
threadPerBlocky=warp_size

    # We reserve a number of block on the memory.
size_Cx = nb_columnsB
size_Cy = nb_linesA
BlockPerGridx = (int) (1 + (size_Cx - 1) // threadPerBlockx);
BlockPerGridy = (int) (1 + (size_Cy - 1) // threadPerBlockx);

time_computation_CUDA_Begin=time.clock()
MatProd(
        # output
        c_gpu,
        # inputs
        a_gpu, b_gpu,
        np.int32(nb_columnsA),np.int32(nb_columnsB),np.int32(nb_columnsB),np.int32(nb_linesA),
        # (only one) block of MATRIX_SIZE x MATRIX_SIZE threads
        block = (threadPerBlockx, threadPerBlocky, 1), grid=(BlockPerGridx,BlockPerGridy)
        )
driver.Context.synchronize()
time_computation_CUDA_End=time.clock()

time_memory_get_result_GPU_Begin=time.clock()
c_gpu_result=c_gpu.get() # We get the result
time_memory_get_result_GPU_End=time.clock()

total_CUDA_time_End=time.clock()
    # --------------------------------------------------------------------
    # --------------------END OF CUDA PART--------------------------------
    # --------------------------------------------------------------------

    # --------------------------------------------------------------------
    # --------------------PYTHON PART-------------------------------------
    # --------------------------------------------------------------------

    # We compute in python :
total_Python_time_Begin=time.clock()
c_cpu=np.empty([nb_linesA,nb_columnsB]).astype(np.float32)
time_computation_Python_Begin=time.clock()
c_cpu=np.dot(a_cpu,b_cpu)
time_computation_Python_End=time.clock()
total_Python_time_End=time.clock()

    # --------------------------------------------------------------------
    # --------------------END OF PYTHON PART------------------------------
    # --------------------------------------------------------------------

    #------------------------------------------------------------
    # We display the execution times :
    # Computation times :
time_computation_CUDA=time_computation_CUDA_End-time_computation_CUDA_Begin
time_computation_Python=time_computation_Python_End-time_computation_Python_Begin
print("CUDA pure computation time : ", time_computation_CUDA)
print("Python pure computation time : ", time_computation_Python)
print(" ")
    # Memory allocation times :
time_memory_alloc_GPU=time_memory_alloc_GPU_End-time_memory_alloc_GPU_Begin
time_memory_get_result_GPU=time_memory_get_result_GPU_End-time_memory_get_result_GPU_Begin
print("CUDA memory allocation time (allocating C, transferring A,B from CPU to GPU):", time_memory_alloc_GPU)
print("CUDA getting result from GPU (Pulling back C from GPU to CPU after computation) :", time_memory_get_result_GPU)
    # Total time (computation + memory allocation)
print(" ")
total_CUDA_time=total_CUDA_time_End-total_CUDA_time_Begin
total_Python_time=total_Python_time_End-total_Python_time_Begin
print("CUDA total time (alloc C + A to gpu + B to gpu + comput + get result) :", total_CUDA_time)
print("Python total time (comput + alloc C) :", total_Python_time)

CUDA pure computation time :  0.040031000000000816
Python pure computation time :  0.10684999999999967
 
CUDA memory allocation time (allocating C, transferring A,B from CPU to GPU): 0.0025399999999997647
CUDA getting result from GPU (Pulling back C from GPU to CPU after computation) : 0.0009440000000005
 
CUDA total time (alloc C + A to gpu + B to gpu + comput + get result) : 0.044026000000000565
Python total time (comput + alloc C) : 0.11297000000000068

/root/.pyenv/versions/3.7.0/envs/py37/lib/python3.7/site-packages/ipykernel_launcher.py:67: DeprecationWarning: time.clock has been deprecated in Python 3.3 and will be removed from Python 3.8: use time.perf_counter or time.process_time instead
/root/.pyenv/versions/3.7.0/envs/py37/lib/python3.7/site-packages/ipykernel_launcher.py:68: DeprecationWarning: time.clock has been deprecated in Python 3.3 and will be removed from Python 3.8: use time.perf_counter or time.process_time instead
/root/.pyenv/versions/3.7.0/envs/py37/lib/python3.7/site-packages/ipykernel_launcher.py:73: DeprecationWarning: time.clock has been deprecated in Python 3.3 and will be removed from Python 3.8: use time.perf_counter or time.process_time instead
/root/.pyenv/versions/3.7.0/envs/py37/lib/python3.7/site-packages/ipykernel_launcher.py:86: DeprecationWarning: time.clock has been deprecated in Python 3.3 and will be removed from Python 3.8: use time.perf_counter or time.process_time instead
/root/.pyenv/versions/3.7.0/envs/py37/lib/python3.7/site-packages/ipykernel_launcher.py:97: DeprecationWarning: time.clock has been deprecated in Python 3.3 and will be removed from Python 3.8: use time.perf_counter or time.process_time instead
/root/.pyenv/versions/3.7.0/envs/py37/lib/python3.7/site-packages/ipykernel_launcher.py:99: DeprecationWarning: time.clock has been deprecated in Python 3.3 and will be removed from Python 3.8: use time.perf_counter or time.process_time instead
/root/.pyenv/versions/3.7.0/envs/py37/lib/python3.7/site-packages/ipykernel_launcher.py:101: DeprecationWarning: time.clock has been deprecated in Python 3.3 and will be removed from Python 3.8: use time.perf_counter or time.process_time instead
/root/.pyenv/versions/3.7.0/envs/py37/lib/python3.7/site-packages/ipykernel_launcher.py:103: DeprecationWarning: time.clock has been deprecated in Python 3.3 and will be removed from Python 3.8: use time.perf_counter or time.process_time instead
/root/.pyenv/versions/3.7.0/envs/py37/lib/python3.7/site-packages/ipykernel_launcher.py:113: DeprecationWarning: time.clock has been deprecated in Python 3.3 and will be removed from Python 3.8: use time.perf_counter or time.process_time instead
/root/.pyenv/versions/3.7.0/envs/py37/lib/python3.7/site-packages/ipykernel_launcher.py:115: DeprecationWarning: time.clock has been deprecated in Python 3.3 and will be removed from Python 3.8: use time.perf_counter or time.process_time instead
/root/.pyenv/versions/3.7.0/envs/py37/lib/python3.7/site-packages/ipykernel_launcher.py:117: DeprecationWarning: time.clock has been deprecated in Python 3.3 and will be removed from Python 3.8: use time.perf_counter or time.process_time instead
/root/.pyenv/versions/3.7.0/envs/py37/lib/python3.7/site-packages/ipykernel_launcher.py:118: DeprecationWarning: time.clock has been deprecated in Python 3.3 and will be removed from Python 3.8: use time.perf_counter or time.process_time instead

!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

%load_ext version_information
import torch, pymetis, skcuda
%version_information torch, pycuda, pymetis, scikit-cuda

The version_information extension is already loaded. To reload it, use:
  %reload_ext version_information

Software	Version
Python	3.7.0 64bit [GCC 7.3.0]
IPython	6.5.0
OS	Linux 4.15.0 34 generic x86_64 with debian buster sid
torch	1.0.0a0+2cdf98a
pycuda	2018.1.1
pymetis	2018.1
scikit-cuda	0.5.2
Wed Sep 26 22:06:30 2018 JST