pandas dataframeのobject型をfloat型に変換する

xlsファイルやcsvファイルを読み込むと、読み込んだ数値データ型が何故だかobjectになっている時が多々ある。objectのままだと数値データを使ってグラフを作成できないので、データ型をobjectからintやfloatに変換する必要がある。

スポンサーリンク

数値データをobjectからfloatに変換する

このサイトから引っ張ってきたデータを実験的に使用する。

import pandas as pd

a=list(range(31,54))
a.insert(0, 3)
df = pd.read_excel("myexcel.xls",sep='\t',thousands=',',usecols=a,na_values=['n/a', '--'])
df.head()
Country 2002 2003 2004 2005 2006 2007 2008 2009 2010 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024
0 Afghanistan 182.842 198.736 200.069 223.737 235.731 267.177 277.498 334.621 362.857 482.629 493.073 506.215 517.858 533.394 552.063 574.127 599.933 629.88 664.452
1 Afghanistan NaN 8.692 0.671 11.83 5.361 13.34 3.863 20.585 8.438 0.988 2.164 2.665 2.3 3 3.5 3.997 4.495 4.992 5.489
2 Afghanistan 178.756 220.013 246.21 304.926 345.817 427.495 517.509 607.227 711.759 “1,226.57” “1,318.48” “1,377.54” “1,418.13” “1,488.86” “1,595.05” “1,733.54” “1,902.11” “2,096.95” “2,322.65”
3 Afghanistan 4.367 4.553 5.146 6.167 6.925 8.556 10.297 12.066 15.325 20.057 19.428 20.235 19.585 19.99 20.682 21.928 23.577 25.45 27.608
4 Afghanistan 18.835 20.852 21.557 24.858 26.983 31.404 33.252 40.403 44.323 64.276 66.384 69.449 72.648 76.158 80.47 85.426 91.113 97.643 105.158

5 rows × 24 columns

df.columns = list(map(str, df.columns))
df.set_index('Country', inplace=True)
df.dtypes
2002    object
2003    object
2004    object
2005    object
2006    object
2007    object
2008    object
2009    object
2010    object
2011    object
2012    object
2013    object
2014    object
2015    object
2016    object
2017    object
2018    object
2019    object
2020    object
2021    object
2022    object
2023    object
2024    object
dtype: object

数値データが何故だかobjectになっているので、これを以下のようにしてfloat型に変えてやる必要がある。

df.apply(pd.to_numeric)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
pandas/_libs/lib.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string ""8,235.10""

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-5-6def46ca8df2> in <module>
----> 1 df.apply(pd.to_numeric)

~/.pyenv/versions/miniconda3-latest/envs/py368/lib/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
   6485                          args=args,
   6486                          kwds=kwds)
-> 6487         return op.get_result()
   6488 
   6489     def applymap(self, func):

~/.pyenv/versions/miniconda3-latest/envs/py368/lib/python3.6/site-packages/pandas/core/apply.py in get_result(self)
    149             return self.apply_raw()
    150 
--> 151         return self.apply_standard()
    152 
    153     def apply_empty_result(self):

~/.pyenv/versions/miniconda3-latest/envs/py368/lib/python3.6/site-packages/pandas/core/apply.py in apply_standard(self)
    255 
    256         # compute the result using the series generator
--> 257         self.apply_series_generator()
    258 
    259         # wrap results

~/.pyenv/versions/miniconda3-latest/envs/py368/lib/python3.6/site-packages/pandas/core/apply.py in apply_series_generator(self)
    284             try:
    285                 for i, v in enumerate(series_gen):
--> 286                     results[i] = self.f(v)
    287                     keys.append(v.name)
    288             except Exception as e:

~/.pyenv/versions/miniconda3-latest/envs/py368/lib/python3.6/site-packages/pandas/core/tools/numeric.py in to_numeric(arg, errors, downcast)
    133             coerce_numeric = False if errors in ('ignore', 'raise') else True
    134             values = lib.maybe_convert_numeric(values, set(),
--> 135                                                coerce_numeric=coerce_numeric)
    136 
    137     except Exception:

pandas/_libs/lib.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: ('Unable to parse string ""8,235.10"" at position 6', 'occurred at index 2002')

恐らく、quotation mark””が邪魔をしているんだと思われるので、こいつを取り除いていやる。

for i, col in enumerate(df.columns):
    df.iloc[:, i] = df.iloc[:, i].str.replace('"', '')
df.head()
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024
Country
Afghanistan 182.842 198.736 200.069 223.737 235.731 267.177 277.498 334.621 362.857 386.368 482.629 493.073 506.215 517.858 533.394 552.063 574.127 599.933 629.88 664.452
Afghanistan NaN 8.692 0.671 11.83 5.361 13.34 3.863 20.585 8.438 6.479 0.988 2.164 2.665 2.3 3 3.5 3.997 4.495 4.992 5.489
Afghanistan 178.756 220.013 246.21 304.926 345.817 427.495 517.509 607.227 711.759 836.222 1,226.57 1,318.48 1,377.54 1,418.13 1,488.86 1,595.05 1,733.54 1,902.11 2,096.95 2,322.65
Afghanistan 4.367 4.553 5.146 6.167 6.925 8.556 10.297 12.066 15.325 17.89 20.057 19.428 20.235 19.585 19.99 20.682 21.928 23.577 25.45 27.608
Afghanistan 18.835 20.852 21.557 24.858 26.983 31.404 33.252 40.403 44.323 48.18 64.276 66.384 69.449 72.648 76.158 80.47 85.426 91.113 97.643 105.158

5 rows × 23 columns

df.apply(pd.to_numeric)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
pandas/_libs/lib.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "8,235.10"

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-7-6def46ca8df2> in <module>
----> 1 df.apply(pd.to_numeric)

~/.pyenv/versions/miniconda3-latest/envs/py368/lib/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
   6485                          args=args,
   6486                          kwds=kwds)
-> 6487         return op.get_result()
   6488 
   6489     def applymap(self, func):

~/.pyenv/versions/miniconda3-latest/envs/py368/lib/python3.6/site-packages/pandas/core/apply.py in get_result(self)
    149             return self.apply_raw()
    150 
--> 151         return self.apply_standard()
    152 
    153     def apply_empty_result(self):

~/.pyenv/versions/miniconda3-latest/envs/py368/lib/python3.6/site-packages/pandas/core/apply.py in apply_standard(self)
    255 
    256         # compute the result using the series generator
--> 257         self.apply_series_generator()
    258 
    259         # wrap results

~/.pyenv/versions/miniconda3-latest/envs/py368/lib/python3.6/site-packages/pandas/core/apply.py in apply_series_generator(self)
    284             try:
    285                 for i, v in enumerate(series_gen):
--> 286                     results[i] = self.f(v)
    287                     keys.append(v.name)
    288             except Exception as e:

~/.pyenv/versions/miniconda3-latest/envs/py368/lib/python3.6/site-packages/pandas/core/tools/numeric.py in to_numeric(arg, errors, downcast)
    133             coerce_numeric = False if errors in ('ignore', 'raise') else True
    134             values = lib.maybe_convert_numeric(values, set(),
--> 135                                                coerce_numeric=coerce_numeric)
    136 
    137     except Exception:

pandas/_libs/lib.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: ('Unable to parse string "8,235.10" at position 6', 'occurred at index 2002')

たぶん、comma,が邪魔をしていると思われるのでこいつを取り除いてやる。

for i, col in enumerate(df.columns):
    df.iloc[:, i] = df.iloc[:, i].str.replace(',', '')
df.head()
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024
Country
Afghanistan 182.842 198.736 200.069 223.737 235.731 267.177 277.498 334.621 362.857 386.368 482.629 493.073 506.215 517.858 533.394 552.063 574.127 599.933 629.88 664.452
Afghanistan NaN 8.692 0.671 11.83 5.361 13.34 3.863 20.585 8.438 6.479 0.988 2.164 2.665 2.3 3 3.5 3.997 4.495 4.992 5.489
Afghanistan 178.756 220.013 246.21 304.926 345.817 427.495 517.509 607.227 711.759 836.222 1226.57 1318.48 1377.54 1418.13 1488.86 1595.05 1733.54 1902.11 2096.95 2322.65
Afghanistan 4.367 4.553 5.146 6.167 6.925 8.556 10.297 12.066 15.325 17.89 20.057 19.428 20.235 19.585 19.99 20.682 21.928 23.577 25.45 27.608
Afghanistan 18.835 20.852 21.557 24.858 26.983 31.404 33.252 40.403 44.323 48.18 64.276 66.384 69.449 72.648 76.158 80.47 85.426 91.113 97.643 105.158

5 rows × 23 columns

df = df.astype("float")
df.dtypes
2002    float64
2003    float64
2004    float64
2005    float64
2006    float64
2007    float64
2008    float64
2009    float64
2010    float64
2011    float64
2012    float64
2013    float64
2014    float64
2015    float64
2016    float64
2017    float64
2018    float64
2019    float64
2020    float64
2021    float64
2022    float64
2023    float64
2024    float64
dtype: object

日本のGDPデータをプロットする

試しに日本のGDPデータをプロットしてみる。

df = df.loc['Japan']
df.head()
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024
Country
Japan 464134.700 471227.700 481616.900 489624.400 496577.00 504791.400 499271.400 472228.800 492023.400 491455.400 516932.500 520081.100 530111.900 534409.850 539630.510 542060.88 544974.610 547739.420 550621.780 553488.850
Japan 0.118 1.528 2.205 1.663 1.42 1.654 -1.094 -5.416 4.192 -0.115 1.223 0.609 1.929 0.811 0.977 0.45 0.538 0.507 0.526 0.521
Japan 515986.200 515400.700 520965.500 524132.900 526879.70 531688.200 520715.700 489501.000 500354.000 491408.400 531319.800 535986.500 545122.000 549016.200 557821.980 567275.01 572865.670 578293.760 584636.480 591508.090
Japan 4115.120 4445.660 4815.170 4755.410 4530.38 4515.260 5037.910 5231.380 5700.100 6157.460 4389.480 4926.670 4859.950 4971.930 5176.210 5495.42 5807.520 6133.860 6475.890 6848.810
Japan 3567.750 3689.540 3872.400 4059.400 4241.64 4427.640 4464.400 4254.780 4484.790 4573.190 5137.290 5225.070 5427.100 5594.450 5749.550 5896.04 6050.970 6207.510 6369.490 6536.650

5 rows × 23 columns

df1 = df[:1]
df1
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024
Country
Japan 464134.7 471227.7 481616.9 489624.4 496577.0 504791.4 499271.4 472228.8 492023.4 491455.4 516932.5 520081.1 530111.9 534409.85 539630.51 542060.88 544974.61 547739.42 550621.78 553488.85

1 rows × 23 columns

df1.T.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7f98432dc390>
from matplotlib.pyplot import *
from matplotlib.font_manager import FontProperties
from matplotlib import rcParams
style.use('ggplot')

fig, ax = subplots()
df1.T.plot(style='.-',ms=20,figsize=(20,10),fontsize=18,ax=ax)
ax.legend(["GDP"],loc=4, prop={'size': 46});