Numpyを使った簡単なデータ加工

このサイトを参照しながらNumpyを学習する。GPUが使えないNumpyなんてどうでもいいと考えていた時期もあったが、実際は、PyCUDA等のGPUプログラミングにもNumpyは不可欠な存在なので、Numpyは避けては通れない鬼門と言える。

スポンサーリンク

preparation

先ずは必要なPython modulesをimportする。

import os
import sys
import glob
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
%precision 4
plt.style.use('ggplot')

必要なデータを下記のサイトからダウンロードしてくる。

# download the data locally
if not os.path.exists('populations.txt'):
    ! wget http://scipy-lectures.github.io/_downloads/populations.txt
--2018-10-08 12:35:23--  http://scipy-lectures.github.io/_downloads/populations.txt
Resolving scipy-lectures.github.io (scipy-lectures.github.io)... 185.199.109.153, 185.199.108.153, 185.199.110.153, ...
Connecting to scipy-lectures.github.io (scipy-lectures.github.io)|185.199.109.153|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://www.scipy-lectures.org/_downloads/populations.txt [following]
--2018-10-08 12:35:23--  http://www.scipy-lectures.org/_downloads/populations.txt
Resolving www.scipy-lectures.org (www.scipy-lectures.org)... 185.199.109.153, 185.199.108.153, 185.199.110.153, ...
Reusing existing connection to scipy-lectures.github.io:80.
HTTP request sent, awaiting response... 200 OK
Length: 525 [text/plain]
Saving to: 'populations.txt'

populations.txt     100%[===================>]     525  --.-KB/s    in 0s      

2018-10-08 12:35:23 (53.1 MB/s) - 'populations.txt' saved [525/525]

ダウンロードしたデータの中身をチラ見する。

# peek at the file to see its structure
! head -n 6 populations.txt
# year	hare	lynx	carrot
1900	30e3	4e3	48300
1901	47.2e3	6.1e3	48200
1902	70.2e3	9.8e3	41500
1903	77.4e3	35.2e3	38200
1904	36.3e3	59.4e3	40600

numpyを使う

numpy arrayにデータをロードする。

# load data into a numpy array
data = np.loadtxt('populations.txt').astype('int')
data[:5, :]
array([[ 1900, 30000,  4000, 48300],
       [ 1901, 47200,  6100, 48200],
       [ 1902, 70200,  9800, 41500],
       [ 1903, 77400, 35200, 38200],
       [ 1904, 36300, 59400, 40600]])

値に分かりやすい名前を付けてやる。

# provide convenient named variables
populations = data[:, 1:]
year, hare, lynx, carrot = data.T

データの年の種ごとの生息数の平均と標準偏差

# The mean and std of the populations of each species for the years in the period
print ("Mean (hare, lynx, carrot):", populations.mean(axis=0))
print ("Std (hare, lynx, carrot):", populations.std(axis=0))
Mean (hare, lynx, carrot): [34080.9524 20166.6667 42400.    ]
Std (hare, lynx, carrot): [20897.9065 16254.5915  3322.5062]

種ごとの生息数が最大だった年

# Which year each species had the largest population.
print ("Year with largest population (hare, lynx, carrot)",)
print (year[np.argmax(populations, axis=0)])
Year with largest population (hare, lynx, carrot)
[1903 1904 1900]

年ごとの生息数が最大だった種

# Which species has the largest population for each year.
species = ['hare', 'lynx', 'carrot']# Which species has the largest population for each year.
species = ['hare', 'lynx', 'carrot']
list(zip(year, np.take(species, np.argmax(populations, axis=1))))
list(zip(year, np.take(species, np.argmax(populations, axis=1))))
[(1900, 'carrot'),
 (1901, 'carrot'),
 (1902, 'hare'),
 (1903, 'hare'),
 (1904, 'lynx'),
 (1905, 'lynx'),
 (1906, 'carrot'),
 (1907, 'carrot'),
 (1908, 'carrot'),
 (1909, 'carrot'),
 (1910, 'carrot'),
 (1911, 'carrot'),
 (1912, 'hare'),
 (1913, 'hare'),
 (1914, 'hare'),
 (1915, 'lynx'),
 (1916, 'carrot'),
 (1917, 'carrot'),
 (1918, 'carrot'),
 (1919, 'carrot'),
 (1920, 'carrot')]

3種中1種以上の生息数が5万を超えた年

# Which years any of the populations is above 50000
print (year[np.any(populations > 50000, axis=1)])
[1902 1903 1904 1912 1913 1914 1915]

種ごとの生息数が最少だった上位2年

# The top 2 years for each species when they had the lowest populations.
print (year[np.argsort(populations, axis=0)[:2]])
[[1917 1900 1916]
 [1916 1901 1903]]

matplotlibでplot

plt.rcParams['figure.figsize'] = 12, 9
plt.rcParams["font.size"] = "18"
plt.plot(year, lynx, 'r-', year, np.gradient(hare), 'b--')
plt.legend(['lynx', 'grad(hare)'], loc='best')
#ax = plt.axis([1900, 1920, -30000, 60000])
plt.xticks(range(1900, 1921, 5))
print (np.corrcoef(lynx, np.gradient(hare)))
[[ 1.     -0.9179]
 [-0.9179  1.    ]]