xlsファイルやcsvファイルを読み込むと、読み込んだ数値データ型が何故だかobjectになっている時が多々ある。objectのままだと数値データを使ってグラフを作成できないので、データ型をobjectからintやfloatに変換する必要がある。

数値データをobjectからfloatに変換する¶

このサイトから引っ張ってきたデータを実験的に使用する。

import pandas as pd

a=list(range(31,54))
a.insert(0, 3)
df = pd.read_excel("myexcel.xls",sep='\t',thousands=',',usecols=a,na_values=['n/a', '--'])
df.head()

	Country	2002	2003	2004	2005	2006	2007	2008	2009	2010	…	2015	2016	2017	2018	2019	2020	2021	2022	2023	2024
0	Afghanistan	182.842	198.736	200.069	223.737	235.731	267.177	277.498	334.621	362.857	…	482.629	493.073	506.215	517.858	533.394	552.063	574.127	599.933	629.88	664.452
1	Afghanistan	NaN	8.692	0.671	11.83	5.361	13.34	3.863	20.585	8.438	…	0.988	2.164	2.665	2.3	3	3.5	3.997	4.495	4.992	5.489
2	Afghanistan	178.756	220.013	246.21	304.926	345.817	427.495	517.509	607.227	711.759	…	“1,226.57”	“1,318.48”	“1,377.54”	“1,418.13”	“1,488.86”	“1,595.05”	“1,733.54”	“1,902.11”	“2,096.95”	“2,322.65”
3	Afghanistan	4.367	4.553	5.146	6.167	6.925	8.556	10.297	12.066	15.325	…	20.057	19.428	20.235	19.585	19.99	20.682	21.928	23.577	25.45	27.608
4	Afghanistan	18.835	20.852	21.557	24.858	26.983	31.404	33.252	40.403	44.323	…	64.276	66.384	69.449	72.648	76.158	80.47	85.426	91.113	97.643	105.158

5 rows × 24 columns

df.columns = list(map(str, df.columns))

df.set_index('Country', inplace=True)

df.dtypes

2002    object
2003    object
2004    object
2005    object
2006    object
2007    object
2008    object
2009    object
2010    object
2011    object
2012    object
2013    object
2014    object
2015    object
2016    object
2017    object
2018    object
2019    object
2020    object
2021    object
2022    object
2023    object
2024    object
dtype: object

数値データが何故だかobjectになっているので、これを以下のようにしてfloat型に変えてやる必要がある。

df.apply(pd.to_numeric)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
pandas/_libs/lib.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string ""8,235.10""

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-5-6def46ca8df2> in <module>
----> 1 df.apply(pd.to_numeric)

~/.pyenv/versions/miniconda3-latest/envs/py368/lib/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
   6485                          args=args,
   6486                          kwds=kwds)
-> 6487         return op.get_result()
   6488 
   6489     def applymap(self, func):

~/.pyenv/versions/miniconda3-latest/envs/py368/lib/python3.6/site-packages/pandas/core/apply.py in get_result(self)
    149             return self.apply_raw()
    150 
--> 151         return self.apply_standard()
    152 
    153     def apply_empty_result(self):

~/.pyenv/versions/miniconda3-latest/envs/py368/lib/python3.6/site-packages/pandas/core/apply.py in apply_standard(self)
    255 
    256         # compute the result using the series generator
--> 257         self.apply_series_generator()
    258 
    259         # wrap results

~/.pyenv/versions/miniconda3-latest/envs/py368/lib/python3.6/site-packages/pandas/core/apply.py in apply_series_generator(self)
    284             try:
    285                 for i, v in enumerate(series_gen):
--> 286                     results[i] = self.f(v)
    287                     keys.append(v.name)
    288             except Exception as e:

~/.pyenv/versions/miniconda3-latest/envs/py368/lib/python3.6/site-packages/pandas/core/tools/numeric.py in to_numeric(arg, errors, downcast)
    133             coerce_numeric = False if errors in ('ignore', 'raise') else True
    134             values = lib.maybe_convert_numeric(values, set(),
--> 135                                                coerce_numeric=coerce_numeric)
    136 
    137     except Exception:

pandas/_libs/lib.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: ('Unable to parse string ""8,235.10"" at position 6', 'occurred at index 2002')

恐らく、quotation mark””が邪魔をしているんだと思われるので、こいつを取り除いていやる。

for i, col in enumerate(df.columns):
    df.iloc[:, i] = df.iloc[:, i].str.replace('"', '')
df.head()

	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011	…	2015	2016	2017	2018	2019	2020	2021	2022	2023	2024
Country
Afghanistan	182.842	198.736	200.069	223.737	235.731	267.177	277.498	334.621	362.857	386.368	…	482.629	493.073	506.215	517.858	533.394	552.063	574.127	599.933	629.88	664.452
Afghanistan	NaN	8.692	0.671	11.83	5.361	13.34	3.863	20.585	8.438	6.479	…	0.988	2.164	2.665	2.3	3	3.5	3.997	4.495	4.992	5.489
Afghanistan	178.756	220.013	246.21	304.926	345.817	427.495	517.509	607.227	711.759	836.222	…	1,226.57	1,318.48	1,377.54	1,418.13	1,488.86	1,595.05	1,733.54	1,902.11	2,096.95	2,322.65
Afghanistan	4.367	4.553	5.146	6.167	6.925	8.556	10.297	12.066	15.325	17.89	…	20.057	19.428	20.235	19.585	19.99	20.682	21.928	23.577	25.45	27.608
Afghanistan	18.835	20.852	21.557	24.858	26.983	31.404	33.252	40.403	44.323	48.18	…	64.276	66.384	69.449	72.648	76.158	80.47	85.426	91.113	97.643	105.158

5 rows × 23 columns

df.apply(pd.to_numeric)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
pandas/_libs/lib.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "8,235.10"

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-7-6def46ca8df2> in <module>
----> 1 df.apply(pd.to_numeric)

~/.pyenv/versions/miniconda3-latest/envs/py368/lib/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
   6485                          args=args,
   6486                          kwds=kwds)
-> 6487         return op.get_result()
   6488 
   6489     def applymap(self, func):

~/.pyenv/versions/miniconda3-latest/envs/py368/lib/python3.6/site-packages/pandas/core/apply.py in get_result(self)
    149             return self.apply_raw()
    150 
--> 151         return self.apply_standard()
    152 
    153     def apply_empty_result(self):

~/.pyenv/versions/miniconda3-latest/envs/py368/lib/python3.6/site-packages/pandas/core/apply.py in apply_standard(self)
    255 
    256         # compute the result using the series generator
--> 257         self.apply_series_generator()
    258 
    259         # wrap results

~/.pyenv/versions/miniconda3-latest/envs/py368/lib/python3.6/site-packages/pandas/core/apply.py in apply_series_generator(self)
    284             try:
    285                 for i, v in enumerate(series_gen):
--> 286                     results[i] = self.f(v)
    287                     keys.append(v.name)
    288             except Exception as e:

~/.pyenv/versions/miniconda3-latest/envs/py368/lib/python3.6/site-packages/pandas/core/tools/numeric.py in to_numeric(arg, errors, downcast)
    133             coerce_numeric = False if errors in ('ignore', 'raise') else True
    134             values = lib.maybe_convert_numeric(values, set(),
--> 135                                                coerce_numeric=coerce_numeric)
    136 
    137     except Exception:

pandas/_libs/lib.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: ('Unable to parse string "8,235.10" at position 6', 'occurred at index 2002')

たぶん、comma,が邪魔をしていると思われるのでこいつを取り除いてやる。

for i, col in enumerate(df.columns):
    df.iloc[:, i] = df.iloc[:, i].str.replace(',', '')
df.head()

	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011	…	2015	2016	2017	2018	2019	2020	2021	2022	2023	2024
Country
Afghanistan	182.842	198.736	200.069	223.737	235.731	267.177	277.498	334.621	362.857	386.368	…	482.629	493.073	506.215	517.858	533.394	552.063	574.127	599.933	629.88	664.452
Afghanistan	NaN	8.692	0.671	11.83	5.361	13.34	3.863	20.585	8.438	6.479	…	0.988	2.164	2.665	2.3	3	3.5	3.997	4.495	4.992	5.489
Afghanistan	178.756	220.013	246.21	304.926	345.817	427.495	517.509	607.227	711.759	836.222	…	1226.57	1318.48	1377.54	1418.13	1488.86	1595.05	1733.54	1902.11	2096.95	2322.65
Afghanistan	4.367	4.553	5.146	6.167	6.925	8.556	10.297	12.066	15.325	17.89	…	20.057	19.428	20.235	19.585	19.99	20.682	21.928	23.577	25.45	27.608
Afghanistan	18.835	20.852	21.557	24.858	26.983	31.404	33.252	40.403	44.323	48.18	…	64.276	66.384	69.449	72.648	76.158	80.47	85.426	91.113	97.643	105.158

5 rows × 23 columns

df = df.astype("float")

df.dtypes

2002    float64
2003    float64
2004    float64
2005    float64
2006    float64
2007    float64
2008    float64
2009    float64
2010    float64
2011    float64
2012    float64
2013    float64
2014    float64
2015    float64
2016    float64
2017    float64
2018    float64
2019    float64
2020    float64
2021    float64
2022    float64
2023    float64
2024    float64
dtype: object

日本のGDPデータをプロットする¶

試しに日本のGDPデータをプロットしてみる。

df = df.loc['Japan']
df.head()

	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011	…	2015	2016	2017	2018	2019	2020	2021	2022	2023	2024
Country
Japan	464134.700	471227.700	481616.900	489624.400	496577.00	504791.400	499271.400	472228.800	492023.400	491455.400	…	516932.500	520081.100	530111.900	534409.850	539630.510	542060.88	544974.610	547739.420	550621.780	553488.850
Japan	0.118	1.528	2.205	1.663	1.42	1.654	-1.094	-5.416	4.192	-0.115	…	1.223	0.609	1.929	0.811	0.977	0.45	0.538	0.507	0.526	0.521
Japan	515986.200	515400.700	520965.500	524132.900	526879.70	531688.200	520715.700	489501.000	500354.000	491408.400	…	531319.800	535986.500	545122.000	549016.200	557821.980	567275.01	572865.670	578293.760	584636.480	591508.090
Japan	4115.120	4445.660	4815.170	4755.410	4530.38	4515.260	5037.910	5231.380	5700.100	6157.460	…	4389.480	4926.670	4859.950	4971.930	5176.210	5495.42	5807.520	6133.860	6475.890	6848.810
Japan	3567.750	3689.540	3872.400	4059.400	4241.64	4427.640	4464.400	4254.780	4484.790	4573.190	…	5137.290	5225.070	5427.100	5594.450	5749.550	5896.04	6050.970	6207.510	6369.490	6536.650

5 rows × 23 columns

df1 = df[:1]
df1

	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011	…	2015	2016	2017	2018	2019	2020	2021	2022	2023	2024
Country
Japan	464134.7	471227.7	481616.9	489624.4	496577.0	504791.4	499271.4	472228.8	492023.4	491455.4	…	516932.5	520081.1	530111.9	534409.85	539630.51	542060.88	544974.61	547739.42	550621.78	553488.85

1 rows × 23 columns

df1.T.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7f98432dc390>

from matplotlib.pyplot import *
from matplotlib.font_manager import FontProperties
from matplotlib import rcParams
style.use('ggplot')

fig, ax = subplots()
df1.T.plot(style='.-',ms=20,figsize=(20,10),fontsize=18,ax=ax)
ax.legend(["GDP"],loc=4, prop={'size': 46});